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SUMMARY 


This is a final report on the tasks supported by NASA Langley Research Center under 
Grant NAG 1-756, Computational Methods and Software Systems for Dynamics 
and Control of Large Space Structures. The report covers progress to date, projected 
developments in the final months of the grant and conclusions. Pertinent reports and 
papers that have not appeared in scientific journals (or have not yet appeared in final 
form) are enclosed. 

The grant has supported research in two key areas of crucial importance to the 
computer-based simulation of large space structure. The first area involves multibody 
dynamics (MBD) of flexible space structures, with applications directed to deployment, 
construction and maneuvering. The second area deals with advanced software systems, 
with emphasis on parallel processing. The latest research thrust in the second area, as 
reported here, involves massively parallel computers. 
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Task 1: MULTIBODY DYNAMICS 


Background 

This is a continuing research task that began in June 1986 and has progressed steadily 
over the past three years. The work has emphasized the following research components: 

(1) Formulation of flexible multibody dynamics in a computationally oriented context. 

(2) Formulation, implementation and evaluation of flexible three-dimensional beam el- 
ements capable of arbitrary motions and implementable in energy-conserving time 
integration methods. 

(3) Development of a library of joint constraints to connect beam elements. 

(4) Development, formulation and evaluation of energy-conserving time integration proce- 
dures, with emphasis on explicit-implicit partitioned solution algorithms for treating 
translational, rotational and constraint degrees of freedom in a staggered manner. 

(5) Parallel implementation of multibody dynamics, including interconnection topology 
analysis and direct time integration. 

(6) Completion of joint constraint library with contact-impact effects. 

Over the past year work has concentrated on areas (4), (5) and (6). The principal investi- 
gator in areas (1) through (5) is Professor K. C. Park, whereas area (6) is jointly supervised 
by Professors Park and Felippa. Three doctoral students have carried out research in these 
areas: Jin-Chern Chiou (fully supported by this grant), Janice Downer (supported by a 
NASA fellowship) and Horacio de la Fuente (partly supported by this grant). 

Following is a summary of accomplishments in areas (4) through (6), which are treated 
more fully in the enclosed reports (References 1-5). 

Staggered Solution Procedures for MBD 

An efficient staggered solution procedure for treating MBD systems has been developed, 
tested and implemented. The MBD equations of motion are partitioned so that the con- 
straint forces appear as independent variables that can be integrated in time, separately 
from the mechanical variables. The latter are in turn partitioned into translational and 
rotational variables. The resulting partitioned equations of motion are integrated by a 
two-stage stabilized algorithm for updating both the translational coordinates and the 
angular velocities. Details of this procedure are given in Ref. 1, included in this report. 
The application of these procedure to simulation of flexible MBD systems composed of 
three-dimensional beams is described in Ref. 2, which is also included in this report. 

MBD Topology Analysis for Parallel Implementation 
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A parallel partitioning scheme based on physical coordinate variables was developed to 
eliminate constraint forces and yield the MBD equation of motion in terms of independent 
coordinates. This scheme features an explicit determination of independent coordinates 
and the parallel computation of the null space of the constrained Jacobian matrix. This 
work is described in Ref. 3, which is included in the present report. 

Parallel Direct Time Integration of MBD 

Using the topological analysis developed under the previous task, a two-stage staggered 
algorithm for parallel computations has been developed, implemented and tested on a 
shared-memory parallel computer. The solution scheme features a new Schur-complement- 
based parallel preconditioned CG algorithm. This solution scheme is a “spin out”. This 
work is described in Ref. 3. 

Development of Contact-Impact Algorithms 

This task began in May 1990 because of delayed funding and is in progress at the time of 
writing. Contact impact is represented by a fictitious, time-varying penalty spring that is 
designed to absorb the impulse of the contacting bodies in the form of a “penalty spring 
energy”. This energy is released totally or partially on separation (partial release is used 
to model dissipation effects) and eventually the spring disappears. This new technique 
offers implementation advantages in that it can be easily accommodated in a variable 
step explicit time integration and this appears well suited to implementation on massively 
parallel computers. Preliminary results on simple impact problems are encouraging as 
regards general physical behavior as well as energy conserving characteristics. 

Task 2: FINITE ELEMENT COMPUTATIONS ON 
A MASSIVELY PARALLEL COMPUTER 


Background 

This task represents the final phase of the software systems thrust. It was started in 
July 1989. The principle investigators are Professors C. Farhat and C.A. Felippa. Post- 
doc Research Associate E. Pramono has presently worked full-time on this project, which 
has also supported a graduate student (L. Crivelli) half-time. The main objective is the 
evaluation of the suitability of the Connection-Machine 2 (CM-2), a massively parallel 
computer, for large-scale finite element computations with emphasis on static analysis. 

This work did not begin from scratch, but has substantially benefited from prior efforts. 
Investigation of the potential of the CM-2 for explicit dynamic calculations began in 1987 
under NRL funding. This work involved Professors Farhat and K.C. Park, and post-doc 
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Research Associate N. Sobh. Work was carried out on the CM-2 computer at NRL, which 
is a half configuration of 32768 processors. The results of this study axe presented in an 
enclosed report (Ref. 6), which is to appear shortly in International Journal of Numerical 
Methods in Engineering. Portions of that work have appeared in Ref. 7, which is also 
included. 

In 1988 DARPA donated a small CM-2 (8192 processors) to the University of Colorado. 
The machine is presently installed at the National Center for Atmospheric Research 
(NCAR) and connected to the Campus Unix network. Although only one eighth of a full 
configuration, the increased availability and our deployment of real-time on-line graphics 
have substantially improved our ability to develop and test software. The CM-2 is not 
an easy machine to program because of its unconventional nature and the initial support 
of only two major programming languages with parallel constructions: CM-Lisp and C*. 
Virtually all programming has been done in C*, which is an object oriented superset of C 
and C -j — f- . 

In August 1989 Dr. Sobh left us to take a faculty position at Old Dominion University. 
Dr. Pramono, whose prior experience in parallel processing had been on shared memory 
machines (especially Cray 2, Alliant and Convex using the Force Preprocessor) had to take 
over and gradually became an expert on the Connection Machine over the past six months. 

Progress 

Our work on the CM-2 to date has concentrated on the following software modules. 

Decomposer. A general-purpose finite element model decomposer, described in Ref. 3, 
that takes as input an arbitrary mesh description, and produces a set of finite element 
data structures that can be loaded within one generic CM-2 chip. 

Mapper. A general purpose mapper that assigns each of the data structures produced by 
the decomposer to a well defined chip. The goal of this allocation strategy is to reduce the 
distance that information has to travel between neighboring finite elements. 

Residual Evaluator. This is a computational kernel that controls the direct calculation of 
element residuals, where “direct” means that no element stiffness matrices are evaluated. 
This kernel interacts with both a transient dynamics algorithm based on Central Difference, 
as well as an iterative solver based on Jacobi- Preconditioned Conjugate Gradients. 

Element Library. This includes a 3D 2-node truss, a 3D 2-node beam, a 3D 8-node brick, 
a 2D 4-node quadrilateral and a 4-node ANS shell element. The shell element has been 
the latest one incorporated in this library, and testing was completed during May 1990. 
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Visualization. A parallel visualization kernel that operates in real time and which displays 
both wire frame representations of the initial and deformed mesh and shaded contour- value 
finite element plots as they are being computed. 

Parallel I/O Manager. A kernel used to archive the computed results on the CM-2 data 
vault. It is based on the Parallel I/O Manager written by E. Pramono and described in 
Ref. 8. 

These software modules together comprise a massively parallel prototype finite element 
code that effectively embeds MIMD computations on a SIMD hardware architecture. 

Conclusions 

Preliminary results using the prototype code with emphasis on truss and frame structures 
are reported in References 6 through 10. In general, it has been found that this highly par- 
allel processor can outperform vector supercomputers such as the Cray family on explicit 
computations but not on implicit ones. 

Several features distinguish the CM-2 from earlier SIMD hypercubes. On the hardware side 
we note the impressive number of crunching power and the fast parallel I/O capabilities. 
On the software side we note the virtual processor concept, which may be viewed as the 
dual of the better known virtual memory concept. 

Mesh decomposition and processor-to-element mapping are the fundamental software mod- 
ules that hold the key to massively parallel finite element computations. A given mesh is 
partitioned into 16 element subdomains that correspond to the 16-processor chips of the 
CM-2. This partitioning is carried out in a way that minimizes the number of nodes at 
the interface between the subdomains. As a result, only those processors that are mapped 
onto finite elements at the subdomain boundary communicate with processors packaged 
onto finite elements at the subdomain boundary communicate with processors packaged 
on different chips. Moreover, this partitioning is such that the bandwidth of the resulting 
subdomain is large enough to allow efficient use of the 12 interchip wires. 

The mapping algorithm attempts at reducing the distance information has to travel over 
the communication network. It searches iteratively for an optimal mapping through a 
2-step minimization of the communication costs associated with candidate mappings. 

The following is a summary of the key conclusions reported in the referenced papers. 

(1) The current CM-2 processor memory size of 64 Kbits penalizes high order elements 
in the sense that only small VP (virtual processor) ratios can be achieved. Thus the 
current configuration favors simpler elements. (This restriction should disappear in 
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the CM-3 model, which will have 1Mbit of memory per processor and an aggregate 
computing power of over 1000 Gflops.) 

(2) Three-dimensional and higher-order finite elements induce longer communication 
times. 

(3) Mesh irregularities slow down the computation speed in various ways. 

(4) The Data Vault is very effective at reducing I/O time. 

(5) The Frame Buffer is ideal for real-time visualization. 

(6) The Virtual Processor concept outperforms substructuring. 

Ongoing Work 

We have found that the CM-2 can outperform the Cray-2 on explicit calculations for which 
sustained rates over 1 Gigaflop are possible. Given the intrinsic scalability of the massively 
parallel architecture (for example, the 1 Teraflop CM-3 under development for DARPA) 
there is little question as to the future potential for that class of computations which arise 
naturally in dynamic simulations. Projections are for 100-1000 times what the fastest Cray 
can achieve. 

On the other hand, implicit calculations arise naturally in the solutions of static problems. 
This class of calculation places a higher burden on communication, which has a detrimental 
effect on performance. For such algorithms the vector supercomputers still outperform the 
CM-2. Semi-iterative methods such as the conventional Conjugate Gradient (CG) also 
suffer to some degree from communications overhead since information has to be gathered 
from shared finite element nodes in residual calculations. 

Over the past six months, Professor Farhat in collaboration with Dr. Roux of ONERA 
(France) has developed an unconventional form of the CG algorithm called the “hybrid” or 
“tearing” method. The primary objective in this development is to reduce communication 
overhead on local memory parallel computers. A secondary objective is to reduce the 
number of iterations for convergence. The present version of algorithm is described in 
some detail in Refs. 11, 12 and 13. The initial version was coded in Fortran augmented 
with the Force preprocessor and tested on the Cray YMP. These tests provided confidence 
in the convergence characteristics on static problems involving up to 48,000 equations. A 
subsequent version was ported to the Los Alamos iPSC Hypercube, on which the reduced 
communication overhead was verified. As final tests, we plan to recode the algorithm in 
C* for the CM-2 and compare with the conventional CG implementation. Because of the 
local memory limitations, however, the domain decomposition on the CM-2 is done at the 
element level. 
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Final Benchmarking Work 

During the period of March to date (July 1990) we have benchmarked large-scale static 
problems on the CM-2 versus the Cray 2 and Cray YMP. The results are being analyzed 
at the time of the writing and will be subsequently reported in the literature. 
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Staggered Solution Procedures for 
Multibody Dynamics Simulation 
K. C. Park, J. C. Chiou, and J. D. Downer 
Department of Aerospace Engineering Sciences 
and Center for Space Structures and Controls 
University of Colorado at Boulder 
Boulder, CO 80309-0429, USA 


I. Introduction 

Simulation of multibody dynamics systems - such as robotic manipulators, automo- 
biles maneuvering and satellites deployment - remains a challenge to the dynamist due to 
its increasing roles in design improvements, control and safe operation. Because of sub- 
stantial progress made during the past three decades in formulation 1-19 , constraint treat- 
ment and solution techniques 21-36 and the availability of multibody dynamics simulation 
packages 37-42 , it has now become almost a routine practice to perform realistic modeling 
and assessment of some practical problems such as mechanical linkages and manipulations 
of robotic arms if multibody components consist mostly of rigid bodies, discrete springs and 
dampers (see, e.g., Haug 15 ). However, substantial advances in modeling, formulation and 
computational methods are necessary in order to develop a real-time simulation capabil- 
ity for ground vehicle maneuvering dynamics, robotic manipulations and space structures 
deployment / assembly. 

Specifically, improved modeling of flexibility for localized motions and geometric non- 
linearities, material nonlinearities and contact /friction phenomena, robust and accurate 
treatment of the system constraint conditions and efficient use of emerging computer hard- 
ware/software technology continue to offer intense research opportunities. Thus, the de- 
velopment of a real-time multibody dynamics simulation capability requires a concerted 
integration of various modeling, formulation and computational aspects. These include: 
selection of a data structure for describing the system topology, computerized generation 
of the governing equations of motion, implementation of suitable solution algorithms, in- 
corporation of constraint conditions and easy interpretation of the simulation results. Of 
these, this chapter is concerned with three computational aspects of multibody dynamics 
simulation: direct time integration of the governing equations of motion, stabilization of 
constraint solution process and their computer implementation aspects. 

From the computational viewpoint, multibody dynamics (MBD) problems are distinct 
from the structural dynamics problems in that the solution of MBD problems must also 
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satisfy, at each time integration step, the attendant kinematic and equilibrium constraints. 
This has motivated many dynamists to develop various techniques, in addition to direct in- 
tegration algorithms, for accurately and efficiently handling the system constraints. Hence, 
reliability and cost of existing MBD simulation packages have been strongly affected by 
how efficiently and accurately the constraints are preserved during the numerical solution 
stage. 

In general, there have been two types of direct time integration algorithms for the 
transient response analysis of dynamical systems: explicit and implicit algorithms (see, 
e.g., Hughes and Belytschko 43 , Park 44 and Belytschko, Englemann and Liu 45 ). Currently, 
implicit algorithms appear to be favored by many MBD specialists when both the gen- 
eralized coordinates and the constraint forces are treated as the unknowns. In this case, 
the corresponding formulations incorporate the system constraints by the Lagrange mul- 
tipliers method. It has been well known that the resulting Newton-like solution matrix is 
stiff. This has led to implicit time discretization of the constraint-augmented equations 
and simultaneous solution of both the generalized coordinates and the Lagrange multipli- 
ers. This approach has been extensively investigated by Gear 21 , Baumgarte 22,29 Orlandea, 
Chase and Calahan 23 , Petzold 27 , Nikravesh 31 , among others. Because these methods solve 
both the generalized coordinates and the constraint forces simultaneously, they will be 
called the simultaneous solution methods in this chapter. 

On the other hand, if the constraints are eliminated so as to reduce the number of 
unknowns, it is possible for one to employ either implicit or explicit algorithm. For this 
situation, one may invoke either a geometric or algebraic procedure to streamline the re- 
sulting equations of motion if the system topology is an open tree. In essence, geometric 
procedures have utilized an open-tree topology such as the use of the incidence matrix by 
Wittenburg 10 and the body array matrix by Huston 19 . Some of the proposed algebraic 
procedures include the singular decomposition by Walton et al 20 , the use of the general- 
ized speed of Kane and Levinson 20 , the coordinate partitioning technique by Wehage and 
Haug 28 , the selection of independent coordinates through the natural-coordinate formu- 
lation of Garcia de Jalon et al 33 and the so-called order-N procedures of Armstrong 11 , 
Hollerbach 12 , Schwertassek and Roberson 17 , Orin, et al 25 , among others. 

As the complexity of MBD systems increases, the simultaneous solution methods 
have become less attractive. This is due to matrix ill-conditioning especially for the so- 
called index two and higher index problems (see, e.g., Ref. 27 and Brenan, Campbell 
and Petzold 46 for the definition of index for constraint characterization), divergence of the 
solution away from the constraint conditions, and ultimately, due to a large size of the 
equations that must be handled. As an alternative to the simultaneous solution methods, 
a series of computational methods that employ a divide- and- conquer strategy have been 
developed, which are termed as partitioned solution procedures presented in Park 47 , Felippa 
and Park 48 and Park and Felippa 49 . As an example, partitioned solution procedures allow 
one to analyze fluid-structure interaction problems with two separate single-field analysis 
packages, namely, the structural dynamics module and the fluid dynamics analyzer. At 
each time integration step, one may advance the solution of structural equations of motion 
by treating the fluid coupling term as an external force. Once the structural coordinates 
are advanced, the fluid state variables can be advanced by treating the structural coupling 
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terms as a source term. A naive partitioned procedure, however, can suffer from a loss of 
accuracy as well as computational stability. Thus, a combination of equation augmentation 
and stabilization should be devised to recover the accuracy loss and maintain unconditional 
stability. Such a solution procedure is in contrast to a practice of embedding both the 
structural and fluid dynamics attributes into a combined analysis program. 

The numerical solution procedure for MBD systems which we advocate in this chapter 
is termed a staggered MBD solution porcedure that solves the generalized coordinates in a 
separate module from that for the constraint force. This requires a reformulation of the 
constraint conditions so that the constraint forces can also be integrated in time. A major 
advantage of such a partitioned solution procedure is that additional analysis capabilities 
such as active controller and design optimization modules can be easily interfaced without 
embedding them into a monolithic program. To this end, the rest of the chapter is organized 
as follows. 

After introducing the basic equations of motion for MBD system in the next sec- 
tion, Section III briefly reviews some constraint handling techniques and introduces the 
staggered stabilized technique 34,36 for the solution of the constraint forces as independent 
variables. 

The n um erical direct time integration of the equations of motion is described in Sec- 
tion IV. As accurate damping treatment is important for the dynamics of space structures, 
we have employed the central difference method and the mid-point form of the trapezoidal 
rule since they engender no numerical damping. This is in contrast to the current prac- 
tice in dynamic simulations of ground vehicles by employing a set of backward difference 
formulas 46 . First, the equations of motion is partitioned according to the translational and 
the rotational coordinates. This sets the stage for an efficient treatment of the rotational 
motions via the singularity-free Euler parameters. The resulting partitioned equations of 
motion are then integrated via a two-stage explicit stabilized algorithm for updating both 
the translational coordinates and angular velocities 34 . Once the angular velocities are ob- 
tained, the angular orientations are updated via the mid-point implicit formula employing 
the Euler parameters. 

When the two algorithms, namely, the two-stage explicit algorithm for the generalized 
coordinates and the implicit staggered procedure for the constraint Lagrange multipliers, 
are brought together in a staggered manner, they constitute a staggered explicit-implicit 
procedure which are summarized in Section V . Section VI presents some example problems 
and discussions concerning several salient features of the staggered MBD solution procedure 
are offered in Section VII. 


II. Governing Equations of Motion 

The Lagrangian equations of motion for mechanical systems that are free from any 
constraint can be written, for the generalized coordinate component as 


d dL 
dt diii 


dL 

du. 


= Qi, i = 1 . . . n. 


( 1 ) 


where L is the system Lagrangian, t is the time, ( ) denotes time differentiation and Qi 
is the generalized applied force. It is well-known that, if there are m-constraint conditions 
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i = 1 . . . n}, the above equation must be modified as 


imposed on {u,, 


d dL dL 


dt dui dui 


= Qi + y £2^kBki, . t = 1 . . . n, 

fc=i 


( 2 ) 


where A is the Lagrange multiplier and Bki is the i-th gradient component of the k-th 
constraint equation, viz, for configuration constraints 


<?fc(u) = 0, B hi = 


d£k 

dui' 


k = 1 . . . m 


(3) 


and for motion constraints 


£*(u, u) = 0, 



k = 1 . . . m. 


(4) 


Therefore, regardless of the nature of constraints one may express the equations of 
motion with constraints in the following form: 


'M B t 1 / ii\ _ (Q 
B 0 J \\J ~ \c 


(5) 


where M is a positive-definite matrix and c depends on the nature of constraints. For 
example, for configuration constraints we have 


c — 


d ,d$ .. n d ,d$ d 2 $ 

du du U dt^du U dt 2 


and for motion constraints 


c = 


dt' 


An implicit time integration formula to solve (5) may be written as 


( 6 ) 

(7) 


ru n = «u"+ h s 

\ u n = 5u" + h; W 

where 8 is a stepsize that is dependent on the choice of formula, and h? and h” are 
formula-dependent historical vectors that consist of past-step solution components 4 ®’ 50 . 
As an example, the trapezoidal rule has the following 8 and historical vectors 

(8 = h/ 2 

< h? = u n_1 +£ii n-1 (9) 

[ h” = u”' 1 + tfu”- 1 


where h is the time-step increment. 
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Substitution of (8) into (5) yields 


'M S 2 B T ] fu n \ _ fr2\ _ fMh n + 6 2 Q n \ 

B 0 J V A n y \rlj V Bh n + S 2 c ) 


( 10 ) 


In practice, in order to avoid pivoting and to maintain high accuracy, the solution of 
the above difference equations is carried out as follows. First, since M is nonsingular for 
properly formulated dynamical problems, one computes 


u u = C = M~ 1 B t , A = BC 


and factors A. Second, one obtains A" by solving 


( 11 ) 




( 12 ) 


Finally, u n is obtained from 


u n = u„ - 6 2 C A n 


(13) 


It should be noted that the accuracy loss associated with the factoring of an ill- 
conditioned matrix BA~ 1 B T and the subsequent backsubstitutions can severely influence 
the solution accuracy of not only the Lagrange multipliers but also the generalized coordi- 
nates as seen from (12) and (13). This has motivated many numerical analysts to undertake 
the development of methods for differential-algebraic systems as the recent monograph 46 
and references therein attest to their rich numerical properties. It is generally agreed that 
the present status of differential- algebraic methods yield robust solutions for problems of 
index one, but can suffer from inaccurate solutions of the Lagrange multipliers for higher 
index problems. Observe that many practical multibody dynamics problems are charac- 
terized by index greater than one. Hence, the need to compute accurately the constraint 
forces remains a challenge. For instance, for lock-up mechanisms that are activated when 
truss structures are fully deployed in space often introduce stiff responses with nearly 
singular state of BM~ 1 B T . It is with these problems for which more robust constraint 
computation algorithms are called for. 

One way to improve the accuracy of constraint force computations is to adopt index 
reduction strategies as discussed in Ref. 46. However, index reduction inevitably intro- 
duces additional system degrees of freedom in the resulting differential-algebraic equations, 
thus destroying the matrix sparsity of (5) in addition to the increased size of the matrix B . 
In what follows we present an alternative approach based on a parabolic regularization of 
the equations for the Lagrange multipliers, which preserves the first row of (5) and enables 
us to solve A from the parabolic differential equations. 


III. Constraint Handling Techniques 

As alluded to in Introduction, techniques for handling the system constraints consti- 
tute a major part of solution procedures for the numerical simulation of multibody dy- 
namics systems. In this section, we will first review the coordinate partitioning technique, 
Baumgarte’s technique and the penalty technique. The staggered stabilization procedure 
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which we advocate will then be described in detail. A distinct feature of the staggered 
stabilization procedure is that it can be implemented in a stand-alone module, thus can 
be interfaced not only with the equation solver for rigid-body systems but with that for 
flexible-body systems as well. 


A. Coordinate Partitioning Technique 

In the coordinate partitioning 28,33 or singular decomposition technique 20 ' 30 , one se- 
lects a rank sufficient part of B and partitions it as 


B=[B, Be J , u=(uj q e J 


(14) 


where the rank of I?,(m x m) is m and the subscripts (i, e) refer to internal and external 
variables, respectively. First, we express u,- in terms of u e as 


u” = B,-‘(r; - B,<) 

Since we have „ „ 

L-BfB- r BJ =0 

The first row of (10) reduces to 


(Me + T T MiT)<£ = r(“ 


(15) 

(16) 


(17) 


where 


and 


T = B~ 1 B e , 


Mi 
0 M e \ » 



rl = r. 


- r T r". -I- T T MiB\ 


_1 rS 


(18) 

(19) 


Once one obtains u", one can obtain u” from (15) and similarly A from (12). Note that 
even though (17) has a smaller dimension than that of (10a), its left-hand side matrix is in 
general full since T given by (18a) is in general full. Hence, unless T is a constant matrix, 
one must refactor the solution matrix in (17) whenever a new T is formed. 


B. Baumgarte’s Technique 

Baumgarte’s technique 22,29 is based on the observation that the errors committed 
in computing the constraint conditions (3) or (4) can either be critically damped out or 
exponentially decreased as the integration process continues. Mathematically, this can be 
stated for the configuration constraint equation(3) as 

# + 2 a$ + = 0 ( 20 ) 

or the motion constraint equation(4) as 

<? + 7 # = 0 ( 21 ) 
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In terms of the general constraint equation augmentation as given by (5b), the pre- 
ceding stabilization is equivalent to modifying c in (5b) accordingly. Hence, the technique 
r ^n be implemented within the standard augmented form of the equations of motion (5). 
However, if BM~ l B T is ill-conditioned, which can happen since B is in general state- 
dependent, the accuracy of generalized constraint force, A, can be considerably degraded. 
This can occur if any two rows of B are physically similar (i.e., when two members form 
a straight line) or numerically close during three-dimensional orientations. 

C. Penalty Technique 

In the two constraint handling techniques outlined so far, the objective was to satisfy 
the constraint condition 

<5 = 0 ( 22 ) 

whose differentiated forms were augmented to the equations of motion. In the penalty 
procedure, one adopts 

A = i<5, e — ► 0 (23) 

e 

as the basic constraint equations instead of the twice-differentiated form adopted m (5). 

It is noted that the penalty formulation tacitly assumes that there will be violations 
of the constraint condition in actual computations as discussed in Lanczos 51 . If one sub- 
stitutes (23) into the governing equations of motion, the resulting equation becomes 

Jlfii -f -B T <5 = Q (24) 

e 

A major drawback of the above penalty procedure is that, once an error is committed 
in computing A, there is no compensation scheme by which the drifting of the numerical 
solution can be corrected. This has led to the development of a staggered stabilized 
procedure as described below. 

D. Staggered Stabilization Procedure 

To illustrate this procedure we will consider the case of nonholonomic constraints. 
Instead of substituting the penalty expression directly into the governing equations of 
motion, first we differentiate (23) once to obtain 

A = I(Bu + f) («) 

where we assume the penalty parameter, e, to be constant. 

Second, we obtain for ii from (5a) in the form 

ii = M _1 (Q -B t A) (26) 


and substitute it into (25) to yield 


eA + BM _1 B r A = r A 


. - d<5 

= BM-Q+a 


(27) 
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Notice that the homogeneous part of the above stabilized equation in terms of the 
generalized constraint forces, A, has the following companion eigenvalue problem: 

( 7 +BM“ 1 B r /e)y = 0 (28) 

where { 7 *., k = 1 . . . m} are the eigenvalues of the homogeneous operator for the new 
stabilized constraint equations (27). Since 7 * also dictates how the errors in the constraint 
forces will diminish with time, the errors committed in the constraint conditions will decay 
with their corresponding different response time constants. This physically oriented stabi- 
lization property of the present technique is in contrast to that of Baumgarte’s technique 
wherein all the error components diminish according to a single time constant. 

Third, this technique enables one to solve for A from the stabilized differential equa- 
tion (27). Specifically, one now has two coupled equations, one set for the generalized 
coordinates u and the other for the generalized constraint forces A, which are recalled here 
from (5a) and (27) for the case of nonholonomic constraints: 


'M O' 

r ui r 

0 e 

UN 


BM -1 B 


■]«}■«} 


(29) 


Note that the above coupled equations directly provide the desired differential equations 
for a pair of [_ u A J . 

For holonomic constraints, one has several stabilization possibilities. The one we have 
chosen is to integrate the governing equations of motion once to obtain 


u n = - b t a”) + h; 


(30) 


which is substituted into 

1 d$ 

A= 7 (Bu + _) (31) 

to yield: 

AT 

e\ n + <5BM - 1 B T A n = B(<5M _ 1 Q n + h?) + (32) 

It is observed that, even if BM - 1 B 7 ' is almost singular, this stabilization tech- 
nique as derived in (27) and (32) would not cause numerical difficulty in computing 
A since the solution iteration matrix becomes (e + <5BM - 1 B T ) for nonholonomic cases 
and (e + <5 2 BM - 1 B T ) for holonomic cases. It is noted that one must choose e in such 
a way to maintain robust solution when BM _ 1 B r becomes ill-conditioned by choosing 
e ~ c /|(BM" 1 B r )- 1 | • |BM _ 1 B r I where c is the solution accuracy desired for A. 

Integration of the above equation by the mid-point implicit rule yields the following 
difference equation: 


(el + £BM - 1 B T )A n+1/4 = J (r" + ’ + r n x ) + eA n 
A n+1/2 = 2A n+1/4 - A n 


( 33 ) 
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It has been shown that the staggered stabilized procedure for the solution of the 
constraints offers not only a modular software package to treat the constraints but also 
has been found to yield more robust solutions compared to the techniques proposed by 
Baumgarte as reported in Park and Chiou^. In particular, even when BM B be- 
comes nearly singular, the staggered stabilized procedure (33) gives stable and acceptable 
solutions whereas the constraint forces computed by the Baumgarte’s technique diverge. 


IV. Solution Algorithms for Generalized Coordinates 

In addition to the choice of implicit and explicit formulas, the recognition that the 
equations of motion for multibody systems with constraints are not ordinary differential 
equations (ODEs) (see, e.g., Petzold 27 ) has placed a unique requirement in the selection 
of solution algorithms for multibody dynamics problems. From the user’s viewpoint, one 
has the option of either employing one of the available ODE packages (see Enright for 
existing ODE packages) or building a special solution module. It should be noted that, 
since the integration of angular velocity vector does not lead to angular orientations, one 
must solve a set of kinematical equations to obtain the desired angular orientations. 

In this section we describe an explicit-implicit transient analysis algorithm that ex- 
ploits the special kinematical relationships of the generalized rotational coordinates vs. 
the angular velocity, namely, the Euler parameters^. The integration of the translational 
coordinates and the angular velocity is accomplished by the central difference formula. It 
should be mentioned that the use of the central difference formula does impose a stepsize 
restriction due to its stability limit ( u> max h < 2) where u> max is the highest angular veloc- 
ity of the system components for rigid-body systems or the highest frequency of the entire 
flexible members for flexible-body systems. The simplicity of its programming effort and 
robustness of its solution results can often become compelling enough to adopt an explicit 
formula, which is the view taken here. 

In conventional structural dynamics analysis, explicit time integration of the equations 
of motion by the central difference formula involves the following two updates per step. 


{ 


^n+l/2 _ u"-l/2 + 

u n+1 = u n + hu n+1/2 


(34) 


Unfortunately, this simplistic procedure is not directly applicable to the rotational part of 
the equations of motion as u) is not directly integrable, except for some special kinematic 
configurations. This motivates us to partition q into the translational velocity vector, d, 
which is directly integrable and the angular velocity vector, u>, which is not, and treat 
them differently, viz.: 

*={£}■ " = {*} (35) 
The equations of motion (5a) can be partitioned according to the above partitioning: 


Md 0 1 J = ! Qi \ 

0 \ io j \ Qu, j 


( 36 ) 
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where 


( 37 ) 


/<?.!/ f„- D d (d) - Sj(d,e) - BjA 1 
1 Q» / 1 f„ - D„(o>) - S„(d, e) - b2> / 


in which the subscripts ( d,u> ) refer to the translational and the rotational motions, re- 
spectively, f is the external force vector, D is the generalized damping force including the 
centrifugal force, S is the internal force vector including member flexibility, q is the angular 
orientation parameters, Bj and are the partition of the combined gradient matrices 
of the constraint conditions (3) or (4) that are symbolically expressed as 

B = B^ + Bff, A == Aw -f Xh (38) 

To effect the body-by-body integration for the rotational degrees of freedom, we par- 
tition u> further into 

w= \ T (39) 

where is a (3x1) angular acceleration vector for the j-th body, 


U! 


U) 


L*i°\ 


O') , ,0) , ,0) 


(40) 


We now present the update algorithm for both translational and rotational coordi- 
nates. 


A. Update of Translational and Angular Velocity 

First, assume that d n+1 ^ 2 and q n+1 / 2 are already computed so that we can compute 
d + ^ and u n+1 ^ 2 by (36), namely, 


r d n+1/2 

\d; n+1/2 


} 


= — M -1 


E>r* + S T' -BjA n+ * 1 

D" + * + S” + * - B^A n+ 2 J 


(41) 


Second, we update the translational velocity and the angular velocity vectors at the step 
(n+1) by 


{ 


d" +1 = d" + M" +,/2 


(42) 


Third, we update the translational displacement, d, by 


d „+ 3/2 = d „+ l /2 + A j »+1 


(43) 


However, the updating of the angular orientation requires somewhat involved computa- 
tions. To this end, we will employ the Euler parameters and update them accordingly. 
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B. Update of Euler Parameters and Angular Velocity 

As mentioned in conjunction with a direct use of (34) for integrating the rotational equa- 
tions of motion, it is necessary for one to introduce a set of generalized coordinates whose 
time rate can be related to the angular velocity. To this end, we employ the four-parameter 
Euler representation of the angular velocity for each body as (see, e.g., Wittenburg 10 ): 


q 


1 

2 


' 0 —u* T 

U) —u) 


q = A(u>)q, q = |_9o 9i 92 93J ! 


(44) 


that is subject to the constraint: 


q T q=l 


(45) 


where 


0 

—^3 

UJ 2 

U>3 

0 

-0>! 

— CJ 2 


0 


u> = [uj U>2 u; 3 J 


and the nodal-designation superscript is omitted for notational simplicity. 

We adopt the mid-point implicit procedure to integrate the Euler parameters: 


' q n+1 = A(u? n+1 ) • q n+1 

q n+1 = q ”+ 1 / 2 4. Aq«+i 

* q n + 3 / 2 = 2q n+1 — q n+1 / 2 

k ( q n +3/2)T . q n+3/2 = 1 


(46) 


(47) 


It should be noted that the mid-point implicit update is no more costly than any explicit 
as the solution matrix inversion can be explicitly obtained. 


Finally, once q n+3 / 2 is computed from (47), it is often required to compute the body- 

rp t 

fixed basis vector, b = [bj b 2 b3 J in terms of the inertial basis vectors, e = 

nn 

[ej e2 63 J . These two vectors axe related by 


where 



b = Re 


(48) 

’2(?o + q\) ~ 1 

2(9i 92 + 9o93) 

2(9i 93 - 9o92)" 


2(9192 — 9093) 

2(9o + 9l) — 1 

2(<?293 + 9o9i) 

(49) 

.2(9193 + 9092) 

2(9293 — 9o9i) 

2(9o +93) “ !. 
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C. Update of d,u>, d, q at the (n + 2)— step 


So fax we have advanced from the step (n+1) to the step (n+3/2). In other words, we have 
advanced only half of the total step. For the next step, viz, the step (n+2) from (n+3/2), 
we employ the following sequence of computations: 


r d n+1 \ = f d; +1 +s ; +1 - bJa “ +1 \ 

i>“ +1 1 M \d; + 1 +S; +1 -BjA n+l J 


{ 


•n+3/2 j n+1/2 »n+l 

d = d + hd 

^n+3/2 = w n+l/2 + fc£»+l 


' d n+2 = d n+1 + hd n+3 2 

qn+3/2 _ A ( u ,n+3/2) q n+3/2 

< qn+3/2 _ q n+l + £q n + 3 / 2 

q n +2 = 2q n+3 / 2 — q n+1 
. (q n+2 ) T q n+2 = 1 


(50) 

(51) 


(52) 


Note that we do not use d n+3//2 and q n+3 / 2 in advancing from the step (n+3/2) to the 
present step (n+2) in computing d n+2 and q n+2 . Instead, we employ d n+1 and q n+1 , hence 
the name two-stage staggered explicit procedure^ . The net result is that, even though we 
take a full step (h instead of h/2), we only advance half the step at a time. In other words, 
we evaluate the acceleration and the angular acceleration vectors twice for each full step. 


V. Implementation 

We will now outline the implementation aspects of the the partitioned MBD solution pro- 
cedure. The procedure is implemented into two separate integration modules: generalized- 
coordinate integrator (CINT) and Lagrange multiplier solver (LINT). The generalized- 
coordinate integrator employs a two-stage modified form of the central difference method 
for updating the angular velocity vector and the mid-point implicit rule for updating the 
angular orientations via the Euler parameters. The Lagrange multipliers solver adopts a 
staggered form of the mid-point implicit method. 

A. Generalized-Coordinate Integrator (CINT) 

The module receives /” = B^A” from LINT and advances the solution of the MBD 
equation (1) from time t n to t n+1 . At each integration step, CINT performs the following 
computations. 

Given: p n = (d" ^ 2 , d n , w n-1 / 2 , q n ) and g n = (w n ,/” = B T A n )) 

Compute: d and u) n by (41) 

Advance: 

( d" +1/! = d"-' /2 + Ad 
1 d n+1 = d" + fcd" +1/2 
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(53) 


r o;"+l/2 - a,"" 1 / 2 + /iw" 

< q n+1 / 2 = ip[ + A A (u; n+1 / 2 )] • q n , A = 1 + + <A + w s) ( 54 ) 

k q n+1 = 2q n+1/2 - q n , (q n+1 ) r • q n+1 = 1 

Output: p“ + ‘ = (d” + ‘ /2 , d" + >, 

Module Invoke: Call CINT (p n , g", h, p n+1 ) 

t 

where h is the stepsize and A(u>) is given by 

AM = i 


and q n+1 / 2 is an intermediate vector and (54c) must be solved to obtain q* 1 ^ 1 so as to to 

*■ t m 

satisfy the linear dependency constraint, q q = 1. 

B. Lagrange Multiplier Solver (LINT) 

This module receives (d, d, u>, q) from CINT and performs the following computations. 
r +i /2 _ (d n+1/2 ? d n+1/2 , o; n+1 / 2 , q n+1 / 2 , A n ) 

B n+1/2 , BM _: |B t and r” +1/2 by (3) and (4) 

( A n+1/4 = (el + *BM- 1 J3 r )" 1 (eA n + ^(r^ + r£ +1/2 )) 

A n+1 ^ 2 = 2A n+1/4 - A n ( 56 > 

y j.n+l/2 _ ^gn+l/2jT . ^n+1/2 
^n+1/2 ^4*1/2 

Call LINT (£"+ 1 / 2 , h, X n+1/2 , fr 1/2 ) 

C. Two-Stage Explicit-Implicit Staggered Procedure 

In order to evaluate u> n+1 , o; n+1 must be known. Notice from the preceding section that 
only d> n+1 / 2 is available. Because inaccurate treatments of the gyroscopic damping and the 
centrifugal force terms can lead quickly to computational instability in computing a> n "*' 1 , 
it is not advisable to obtain w n+1 by extrapolating with a> n+1 / 2 and w n-1 / 2 . To mitigate 


Given: 

Compute: 

Advance: 

Output: 
Module Invoke: 


0 

—u 1 

— 

— u; 3 

U>1 

0 


— U)2 

U>2 

—UJ3 

0 

0)1 

U>3 

0)2 

-u>i 

0 


(55) 
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this difficulty, we advance only to the next half step, at each CINT and LINT call. This 
is illustrated as follows: 

t = t n 

Call CINT (p n , g n , h, p n+1 ) 

Call LINT (£ n+1 / 2 , h, A n+1/2 , f" +1/2 ) 
t = t n + h/2 (n ♦— n + 1/2) 

Call CINT (p n+1/2 , g n+1/2 , h, p n + 3 / 2 ) 

Call LINT (r +1 , h, A n+1 , f£ +1 ) 
t = t n -l- h 

Note that 

g n+l/2 = (u; n + l/ 2> / n+l/2 ) 

together with 

p"+>/ 2 = (d“, d" +1 ' 2 , U", q n + 1 / 2 ) 

provides the necessary input data to compute d + ^ and u> n+1 ^ 2 in the second call of 
CINT in the above calling sequence. In summary, the present procedure requires two 
function evaluations and two A-solutions per each full step, hence the name “two-stage 
explicit-implicit staggered procedure”. 

VI. Numerical Examples 

The two modules, the generalized coordinate integrator (CINT) and the Lagrange multi- 
pliers solver (LINT), have been implemented in Fortran 77. In solving the following three 
example problems, we have incorporated the constraint conditions through the use of La- 
grange multipliers instead of eliminating the constraints. It is therefore necessary to solve 
the governing equations of motion in a way that satisfies the constraint equations. Hence, 
efficient and accurate solutions of these problems will confirm not only the viability of the 
present integration procedure for the solution of the multibody equations of motion with 
or without constraints but also the constraint stabilization procedure in their combined 
totality. 

A. Plane Three-Link Manipulator 

The first problem tested is a simplified version of the seven-link manipulator deployment 
problem 52 . The three links axe initially folded and, for modeling simplicity, between the 
two joints is a coil spring which resists a constant deploying force at the tip of the third 
link. Also, the left-hand end of the first link is fixed through the same coil spring to the 
wall. These three coil springs are to be locked up once the links are deployed straight. The 
deployment sequence of the manipulator is illustrated in Fig. 1. The time-discretized dif- 
ference equations both for Baumgarte’s technique and the staggered stabilization technique 
have been solved at each time increment by a Newton-type iterative procedure to meet 
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a specified accuracy level. Hence, the performance of the two techniques can be assessed 
by the average number of iterations taken per time increment. This is presented in Fig. 

2 for the accuracy of 10“ 4 . Notice that the staggered stabilization technique requires on 
the average about 4.5 iterations per step, whereas Baumgarte’s technique requires about 
22 iterations per step. 

Note that Baumgarte’s technique fails to converge for time, t ~ 1.1 as manifested in Fig. 2 
because the rows in B become numerically dependent upon one another when the links are 
in a straight configuration. This corroborates the theoretical prediction of non-convergence 
whenever the solution matrix, BM 1 B for Baumgarte s technique (see Eqs.(5b), (20) 
and (21)) becomes singular. On the other hand, the staggered stabilization technique still 
converges within 30 iterations, because it overcomes this singularity difficulty, since A still 
exists, as can be seen from Eqs. (27) and (32). 

It should be noted that, in order to avoid such ill-conditioning, one must differen- 
tiate the constraint equations once or twice more and recast the resulting higher-order 
constraint equations in terms of first-order equations with increased number of equations. 
This process is known as an index reduction strategy 46 . Thus, one must restructure the 
augmented equations of motion (5) with the net result of increased solution variables. 
Other techniques involve singular value decompositions, e.g., as advocated by Fuhrer and 
Leimkuhler 53 . On the other hand, the present staggered stabilization technique overcomes 
the ill-conditioning difficulty without restructuring the governing equations of motion. In- 
stead, the constraint equations are enforced in a separate module by the parabolically 
regularized equations for the Lagrange multipliers as derived in (27) and (32). 

Although not reported here, the same relative performance has been observed for different 
accuracy levels, i.e., for the accuracy of 10 -5 and 10 -6 . 

From this test problem, we conclude that the staggered stabilization technique yields 
both improved accuracy over and greater computational robustness than the Baumgarte 
te chni que. In addition, the staggered stabilization technique offers software modularity in 
that the solution of the constraint force, A, can be carried out separately from that of the 
generalized displacement, q. The only data each solution module needs to exchange with 
the other is a set of vectors, plus a common module to generate the gradient matrix of the 
constraints, B. However, one should be cautioned not to extrapolate blindly to complex 
problems the results of the present simple examples. Further judicious experiments are 
needed in applying the present staggered stabilization technique to complex production- 
level problems before it can be adopted for general applications in multibody dynamic 
simulations. 


B. Three-Dimensional Double Pendulum 

The second problem with which we have tested the present procedure is a spatially moving 
double pendulum as shown in Fig. 3. The governing equations of motion become those of 
two separate rigid bars, except they are connected by two spherical joints. From Fig. 3 
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we have the the following quantities: 






x z‘ 


= 0 , 


i = 1, 2. 


( 57 ) 


M = diaglm 1 , J 1 , m 2 , J 2 } 


( 58 ) 


B 


I §z*x 0 0 

I — §z*x -I — |z 2 x 


F w = {£}, 

it‘ = {d, d>}\ 


0 1 
0 

fl = ~ u 2 u> 3 (J 2 -J 3 ) ’ * = 1, 2 ' 

U) 3 U}i(J 3 — Ji) 

_ 07i<xJ2(t7l — ^ 2 ) . 

<*' = [i, y, i] , U>' = [ill, i>2, i'll 


A = [A„ 


A2, A3, A., A5, Ag] ' 


( 59 ) 


( 60 ) 


( 61 ) 

( 62 ) 


In the preceding equations, is the vectorial distance from the center of the bar to 
the spherical joint constraints, m and J are the three translational and rotatory inertia 
matrices, z is the skew symmetric matrix formed by the three components of z, x implies 
a vector cross multiplication, and the superscript designates the i-th bar. 

The pendulum is originally positioned in a gravity field with initial horizontal angular 
velocities = u>i 2 ^ = 1). Figure 4 shows the spatial trajectories of the two mass centers 
as projected on the horizontal surface and on the vertical plane. It is noted that the two 
trajectories form a similar pattern. The constraint forces and angular velocities, although 
not reported herein, exhibit patterns that are analogous in their characteristics for the two 
joints and two mass centers, respectively. 

We have performed convergence studies by using different stepsizes h. Numerical evalua- 
tions indicate, as with the rigid-link problem, that when the stepsize samples more than 
20 per period, the present procedure yields both good accuracy and stability. 

C. Open-Loop Torque for Three-Link Manipulator 

The third problem is a three-fink manipulator maneuvering under a specified nonholonomic 
tip velocity constraint. For this problem, both rigid links and flexible finks with four 
beam elements per fink have been investigated. The flexible beam was modeled with a 
constant-strain Timoshenko beam element that allows large rotations. The three joints are 
modeled as spherical ones and the Lagrange multipliers have been introduced to enforce 
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the joint constraints and well as the nonholonomic constraint at the man ipulator tip. The 
trajectories of the manipulator and the tip velocity specification axe shown in Figs. 5 
and 6. The corresponding joint torques for the rigid and flexible links are also shown in 
Figs. 7 and 8, respectively. Note that even though there exists little difference in the two 
trajectories of the rigid and flexible cases, there are significant differences in the open-loop 
joint torques. These will play an important role in the design of controller for vibration 
suppression in the manipulator arms. 

VII. Discussions 

In this chapter, we have presented a computational procedure for direct integration of the 
multibody dynamical (MBD) equations with constraints. 

Because of its step- advancing nature, the procedure is labeled as a two-stage staggered 
explicit-implicit algorithm: explicit for solving the generalized coordinates (CINT) and 
implicit for Lagrange multipliers to incorporate constraints (LINT). Our numerical exper- 
iments indicate that it is essential to enforce the linear dependency constraint condition 
on the Euler parameters at each integration step. 

Numerical experiments reported herein and additional applications conducted so far in- 
dicate that the present procedure yields robust solutions if the stepsize gives more than 
twenty samples for the period of the apparent highest response frequency of a given multi- 
body system. Hence, the present procedure appears to have accomplished the following: 

• For closed loop multibody systems and/or problems with complex topology wherein it 
is practically inadvisable to eliminate the constraints, the present procedure facilitates 
a straightforward construction of the governing equations of motion with appropri- 
ate constraints. The generalized coordinates and the system open and closed loop 
Lagrange multipliers can then be solved by the present procedure in a partitioned 
manner. 

• For problems that involve lock-up mechanisms or similar discontinuities, the present 
procedure appears to overcome numerical difficulties encountered in using the Baum- 
garte stabilization. This may be an important impetus for applying the present pro- 
cedure for the simulation of deployment dynamics of space structures. 

• The angular velocity is obtained by an adaptation of the central difference algorithm 
in a two-stage form and the update of angular orientations is based on the Euler pa- 
rameters by adopting the mid-point implicit formula. Both of the integrators conserve 
the system energy, which is important when the multibody simulation package is to be 

. interfaced with an active control synthesis module. This is because stability margins 
of active control systems are sensitive to the system damping characteristics either 
physical or numerical. 

• The present MBD solution procedure is implemented into two separate modules: the 
generalized coordinates solver (CINT) and the constraint Lagrange multiplier solver 
(LINT). Hence, the task for interfacing of the present MBD solution modules with 
additional capabilities such as active controller, observer and other analysis and design 
software modules becomes relatively straightforward. Such software architecture is 
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in contrast to most of the existing programming practice wherein several analysis 
capabilities are embedded into a single monolithic program. 

Applications of the present procedure to flexible multibody systems axe currently being 
carried out and preliminary results are quite encouraging. We hope to report on the results 
of flexible-body dynamics as well as on large-scale multibody problems in the near future. 
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Abstract 

A computational procedure suitable for the solution of equations of motions for flexible 
multibody systems has been developed. The flexible beams are modeled using a fully non- 
linear theory which accounts for both finite rotations and large deformations. The present 
formulation incorporates physical measures of conjugate Cauchy stress and covariant strain 
increments. As a consequence, the beam model can easily be interfaced with real-time strain 
measurements and feedback control systems. A distinct feature of the present work is the com- 
putational preservation of total energy for undamped systems; this is obtained via an objective 
strain increment/stress update procedure combined with an energy-conserving time integra- 
tion algorithm which contains an accurate update of angular orientations. The procedure is 
demonstrated via several example problems. 


1. Introduction 

The simulation of flexible multibody systems is becoming an increasingly important 
tool for the design and operation of many engineering applications. Typical examples of 
such systems include deployable space structures, high precision machine dynamics and 
robotics, and other problems containing controlled positioning of structural components. 
The components of these articulated structures typically undergo large relative displace- 
ments and rotations in order to carry out the intended operations. To perform the desired 
kinematic motions, various types of mechanical joints are introduced to constrain the rel- 
ative motion between the various components. New technology needs of both the space 
and robotics industries have increased the demand for accurate numerical simulations of 
the effect of component flexibility on the performance of multibody systems. A significant 
coupling between the gross structural motion and the elastic deformation can be expe- 
rienced by typical applications in which lightweight structures with higher flexibility are 
required to operate with greater positioning accuracy and at higher speeds. To capture 
this phenomenon, a realistic mathematical model of the structural component that can 
readily be incorporated into a general multibody dynamics methodology is necessary. 

Two basic approaches, the floating frame approach and the nonlinear continuum ap- 
proach, exist for the modeling of flexible components within a general multibody system. 


1 



The floating frame approach introduces a moving reference frame to follow some overall 
mean rigid body motion of the beam; the elastic deformation of the beam is then de- 
scribed relative to this moving reference 1-6 . With this approach, the classical multi-rigid 
body analysis was extended to include structural flexibility by superposing existing linear 
deformation descriptions onto the rigid motions of the floating reference frame 7,8 . The 
definition of such a mean axis system and the corresponding deformation modes within 
the general context of the finite element method has been presented* 1-11 . To minimize the 
number of elastic coordinates, coordinate transformations from the physical elastic coor- 
dinates to modal coordinates were performed within the multibody dynamics context 12 , 
and static correction modes were used in conjunction with the normal modes of vibration 
to account for reaction forces and torques transmitted to the components through joint 
connections 13,14 . An alternative choice of a floating reference frame for finite element appli- 
cations, termed the convected coordinate system, was introduced as a simple separation of 
the rigid body motion and the structural deformation for a given finite element 15-18 . All of 
these studies, however, are limited by the assumption of linear deformation theory which is 
inadequate to capture certain nonlinear phenomena. Nonlinear deformation theories must 
be taken into account for such instances as the geometric stiffening of a spinning beam 19,20 
in which the structural component experiences a centrifugal force as well as applications 
in which the components necessarily have low mass and very high flexibility. Extensions 
of the original approach to model the nonlinear effects include the substructuring tech- 
nique in which the component is further partitioned into substructures each with a local 
reference frame where normal vibration and static correction modes can then be used to 
model the deformation 21 , and the finite element incorporation of a no nlin ear Green strain 
measure 22,23 . The resulting equations of motion of the floating frame approach, written in 
terms of a set of reference coordinates and a set of relative elastic coordinates, inherently 
contain a complex coupling of the gross motion and the elastic deformation modes. 

Recently, a fully nonlinear continuum approach to describe the dynamics of the flexible 
beam has been pursued 24-28 . Through the use of finite-deformation rod theories 29-32 , the 
approach is capable of directly accounting for both finite rotation kinematics and large 
deformations of the beam component. Since the motion due to rigid rotations of the beam 
is not distinguished from that due to deformations, the need for a floating reference frame 
is completely obviated and the component inertia is identical in form to that of a rigid 
body. The inherent nonlinear coupling between the gross body motion and the elastic 
deformation is transferred to the stiffness part of the equations of motion. The key to 
the successful adoption of this approach is to develop a computational procedure for the 
nonlinear internal force term that preserves rigid body motions. 

The aim of this paper is to incorporate the nonlinear continuum formulation of the 
spatial beam motion into a general multibody dynamics software methodology. The present 
formulation employs a convected coordinate representation of physical Cauchy stresses 
and corresponding set of physical strains. This representation naturally lends itself to the 
“software in the real-time experiment” loop as sensors measure only physical quantities. 
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Another advantage of the formulation is that the degrees of freedom of the beam component 
embody both the rigid and flexible deformation motions. The task for incorporating the 
multibody system constraints becomes straightforward, and the equations of motion for an 
arbitrary configuration of flexible beams and rigid bodies can automatically be generated 
in terms of an identical set of physical coordinates. Numerical solution procedures for 
the integration of spatial kinematic systems can then be directly applied to these physical 
coordinates. Such a universal treatment is not applicable within the context of the floating 
frame approach as the reference and elastic coordinate definitions axe of highly different 
character. 

The rest of the paper will be organized as follows. Section 2 will detail the kinematic 
description of the continuum beam in which the total motion is referred directly to the 
inertial reference frame. The principle of virtual work of a continuum as specialized to 
the spatial motion of the beam component is detailed in Section 3. The subsequent finite 
element discretization of the beam component and overall multibody system equations are 
then presented. Section 4 will summarize the staggered procedure for the integration of 
multibody dynamic systems. The virtual work expression is used to derive the method 
of computation of the internal force, and Section 5 will address this algorithmic treat- 
ment of the nonlinear stiffness operator. Section 6 will present some example problems 
demonstrating the software capabilities. 


2 . Beam Kinematics 

The present formulation adopts an inertial reference frame for describing the trans- 
lational motions and a body-fixed frame for the rotational motions. The consequence of 
this description is that the translational and rotational variables embody information due 
to both rigid rotations and deformations of the beam. The configuration of the beam, as 
shown in Figure 1, is completely characterized using a position vector locating the neutral 
axis of the beam from the inertial origin and a body-fixed frame representing the orienta- 
tion of the cross-section with respect to the inertial reference frame. The position vector 
r locating an arbitrary particle point on the beam is thus described as 

r = (X + u) T e -I- £ t b (2.1) 

where “boldface” symbols represent three subscripted vectors and the normal type symbols 
represent three components of a given vector; e = { ei,e2,e3 } represents the three 
orthogonal vectors defining the inertial reference frame; b = { bi,b 2 ,b 3 } represents 
the body-fixed reference frame which is attached to and rotates with the beam cross section; 
X = { X\, X 2 , X$ } T represents the inertial components of the original neutral axis 

position; u = { Ui,u 2 ,U 3 } T represents the inertial components of the subsequent total 
translational displacement of the neutral axis, and t T — { 0 ,£ 2 ,^3 } axe the body-fixed 
components of the distance from the beam neutral-axis to the material point located on 
the deformed beam cross-section. It is noted that the beam cross-section is allowed to 
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rotate such that it is not necessarily perpendicular to the neutral axis in order to model 
transverse shear deformations. Warping deformation of the cross-section is not taken into 
consideration. 


In order to derive the necessary time derivatives for the description of the large rotation 
dynamics, we employ the well known formula 33 : 


d_ _ £. _ 
dt dt dt 


(2.2) 


where u> is the angular velocity vector and the superscripts e and b indicate that the 
derivatives are to be those observed in the inertial (space) and body (rotating) system of 
axes respectively. The above is expressed in the matrix form to act on the body frame 
components of a given vector 


d_ _ d* 
dt dt 


(2.3) 


and the velocity and acceleration of the position vector (2.1) are 


d r 

dt 



+ 


t T~.T 


' u 


b 


d 2 r 

It 2 


Given the following relation 


d 2 u T 


+ e T { 


d b u T 


+ Cr dr ) b . 


dt 2 ~ ~ K dt 

between the b-basis and the e-basis 


(2.4) 


b = Re (2.5) 

where R. is a (3 x 3) orthogonal transformation matrix, the body frame components of the 
skew-symmetric angular velocity tensor ( d T ) are 


~t _ ^ R t>r 

“ “ IT R 


A conjugate virtual rotation tensor is defined analogous to the above as 


(2.6) 


Sa T = 6 RR t , 


and the variation of the position vector (2.1) is given as 


(2.7) 


6r = Su T e + £ T Sa T b . (2.8) 

The equations of motion as derived from the stated beam kinematic description will be 
discussed next. 
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3. Spatial Beam Equations of Motion 

The principle of virtual work, which is simply a ‘weak’ or variational form of Cauchy’s 
differential equations of motion for the equilibrium of a given set of particles of a continuum, 
is stated as 34 

Sn pr t dV + Jv a\j dV = j 6r,fi dV + J 6riU dS . (3.1) 

The cartesian coordinates x, represent the particle position after some deformation has 
taken place, Sr, a kinematically admissible virtual displacement, r , the acceleration, /, 
the external force per unit mass, and t, the stress vector acting on a surface with outward 
normal components n t . Likewise, afj represents the cartesian components of the Cauchy 
stress tensor, and p is the mass density. The expression is tailored for the continuum 
beam by using the kinematic relations (2.1), (2.4), and (2.8) for the components x t , Sr ;, 
and f, respectively. As well as providing the basis for a finite element approximation 
techniques, the variational formulation readily lends itself to the derivation of incremental 
strain-displacement relations as deduced from the derivatives of the virtual displacement 
components. The present formulation employs a physical stress measure defined as a force 
per unit deformed area and the conjugate physical strain increments based on the de- 
formed coordinates. As such, the formulation can be recast into a convected coordinate 
system moving with the beam, thus simplifying the stress and strain computational proce- 
dures. The practical advantages of such a formalism are in real-time software simulation 
experiments as the computed physical quantities correspond to the actual stress/strain 
measurements of the sensors located and operating on the deformed structure. 

For notational convenience and subsequent finite element discretization, the principle 
of virtual work is expressed in the following operator form: 

SF 1 + 6F S = SF e -I- 6F t (3.2) 

where the inertia operator SF 1 , internal force operator 8F S , external force operator SF E , 
and traction operator SF T are identified from (3.1). Explicit expressions for the various 
operators incorporating the large rotation beam kinematics are derived in Sections 3.1 to 
3.3. The finite element discretizations axe given in Section 3.4, and the incorporation of 
the beam formulation into the multibody dynamics framework is discussed in Section 3.5. 


3.1 Spatial Beam Inertia Operator 

The inertia operator was defined from (3.1) as 

SF 1 = f p Sr t r { dV = f p Sr • r dV (3.3) 

Jv Jv 
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from which an expression can be derived directly from the kinematic equations (2.4) and 
(2.8). If the origin of the body-fixed basis is located at the centroid of the cross-section, 
the following simple expression results for 8F Im . 

r ( p a if ) 

8F 1 = { 8u t 8a T } < > ds (3.4) 

Js l J 4f + « J w j 

where 

f p IF dA = J 

J A 

represents the inertia tensor of the beam cross-section and ds represents the remaining 
integration to be performed over a beam length parameter. The translational inertia is 
completely decoupled from the rotary inertia and is of the same form as that seen in rigid 
body dynamics. This is due to the dual choice of the translational displacements measured 
in the inertial basis and the angular velocity measured in the body-fixed basis located at 
the center of mass of the cross-section. 


3.2 Spatial Beam Internal Force Operator 

The internal force operator was defined in (3.1) as 

* FS = X ijr ^ (3 - 5 > 

identifying as conjugate quantities the virtual displacement gradient and the Cauchy stress 
tensor. This form of the interned force along with the beam kinematic description will be 
used to deduce a set of virtual strain-displacement relations that are invariant to rigid body 
motions. The corresponding conjugate stress tensor will be obtained from an objective 
incremental procedure that relates incremental strains obtained from the virtual strain 
tensor to Cauchy stress increments. Thus the internal force term will be derived completely 
from the original definition of the beam kinematics without making an a priori definition 
of the existing strains or stresses. 

A physically appealing decomposition of the stress and virtual strain tensors into an 
alternative beam reference frame which lies tangent to the deformed neutral axis is intro- 
duced to provide conceptual simplifications in the derivation and subsequent computations. 
Certain stress states referenced to this convected frame axe kinematically required to van- 
ish in a beam formulation. When applied to the convected frame stress components, this 
choice also leads the task of stress update computations to a simple additive procedure. 
To this end, we introduce a convected reference frame, denoted by a, which is related to 
the inertial reference frame e by 

a = T e , a = { a!,a 2 ,a 3 } T . 
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(3.6) 


For implementation purposes within the context of the finite element method, the con- 
vected frame will be constant on the element level and thus is similar in concept to that 
introduced by Belytschko et al. 15,16 . It is noted that this reference frame does not coincide 
with the body frame b attached to the cross-section. The relative difference between these 
two reference frames is represented by the rotation matrix S which models the effects of 
transverse shear and torsion deformation as 


b = Sa, R = ST , 

and the latter interdependence between the rotation matrices is established. 


(3.7) 


The internal force operator, originally characterized by the inertial frame components 
of the Cauchy stress tensor ( afj ) and conjugate virtual displacement gradient, will equiv- 
alently be expressed in terms of the convected frame components of the stress tensor ( cr“ ) 
and a corresponding convected virtual displacement gradient as 


8F 


s — 


-L 




= f T 

— / x mi 

JV 


dbri 


' mk 


dV 


(3.8) 


iv uxj - j v d£k 

The symmetric portion of the transformed deformation gradient is used to define the virtual 
strain tensor as 


8e a 


mk 


s 5 < T ” 


d8ri d8ri * 

+ J-ki ~ ) 


(3.9) 


d£ k ' d 

which is an objective tensor invariant to arbitrary rigid body motions. The internal force, 
written in terms of the convected frame tensors, will be expressed in vector format as 


8F i 


= / «*: 

Jv 


mk u mk 


dV 


where the notation 


f 

f Se J* ' 

= / { } i 


Jv 

l S£ t< > 

{ } , ( ),j = 1 

( )ij + 


> dV 


(3.10) 


denotes the coordinates of the convected reference frame and the engineering shear strain 
definitions respectively. The rest of the convected frame strain components 


8e r 


Set 


are identically equal to zero due to the original assumptions of the beam kinematics. 

A set of virtual strain-displacement relations can be derived from the expressions (2.8) 
and (3.9). The final result is expressed as 



= 6j + t T 8 k 


(3.11) 
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where 


h = T fr + I - // 3 } ’ u = Hr ' 60 = sT(a (312) 

and is comparable to that of Reissner^*. In the above expressions, 8*y represents the 
membrane and two transverse shear strains, 6 k the torsion and two bending strains, and 
8f3 the virtual rotations of the cross-section referred to the convected frame. 


In an analogous manner the total stress state is expressed as 


< Vtr, > = cr 7 + F a K (3.13) 

. J 

to be obtained from a separate stress update procedure. A substitution of (3.11) and (3.13) 
into (3.10), and a spatial integration over a symmetric cross-sectional area results in the 
following expression for the internal force 


8F S = I { Sj T N y + 6 k t M K ) d( (3.14) 

where N y represent the axial and transverse shear forces per unit length, and M K represent 
the torsional and bending moments per unit length as given by 

Ny = j a dA , M k = f F a dA . (3.15) 

Ja J a 

To be consistent with the inertia operator derived in (3.4), the above is written as 

6FS = / { 6uT 6( * T * [ 5 ] T ^ (3-16) 

which involves a transformation back to the body frame components of the virtual rotations 
and also an identification of the desired incremental strain-displacement matrix B. To 
effect the change of the body reference frame of the cross-section orientation in space with 
respect to the constant convected reference frame, we invoke the following relations: 


d a 60 cT d a 6a 
d£ ~ d£ 


S T ( 


d b 6a 

di 


+ ks 8a ) 


~T 

«5 


a a s 


s r 


(3.17) 

(3.18) 
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which axe completely analogous to those relating changes in the time derivative given in 
(2.3) and (2.6). The strain operator [ B ] of (3.16) is then recognized as 


T f? 

ii S T 


' 0 

0 

0 ' 



7 *1 = 

0 

0 

-1 

0 

S T ( «s + I . 


0 

1 

0 _ 


(3.19) 


It remains to provide a procedure for updating <r 7 and a K in order to compute A r 7 and 
M k . For this purpose, we employ the following rate-type law that relates the instantaneous 
rate of stress to the instantaneous rate of deformation: 

&kl — c klmp £ mp (3.20) 

where c*j mp represents the material response tensor, and and e%, p represent the 

convected frame stress and strain rates, respectively. This approximate constitutive law can 
be derived by transforming the Truesdell rate equation 35 , which is an objective equation 
based on inertial components of Cauchy stresses and the velocity gradient tensor, to the 
convected basis. This equation is then integrated in time as 


n + t 


’ kl 




W n + 1 

= °kl " 

+ 

/ c klmp £ m p 



Jt n 

= oil " 

+ 

Cklmp 


(3.21) 


to define the stress update procedure. The evaluation of the strain increments A e^ p , to 
be defined from the virtual strains (3.12), will be detailed in Section 5. 


3.3 Spatial Beam External Force and Traction Operator 

The external force operator defined in (3.1) as 

SF e = f Sr, fi dV 

Jv 

has the final resultant form 

6F E = J { Su T 6c T ) ( { ' b ) di (3.22) 

where f e represents the inertial components of a force per unit length acting on the beam 
neutral axis and f b represents the body-fixed components of a moment per unit length 
acting on the beam cross-section. The traction operator defined as 

6F t = J Sr, t, dS (3.23) 
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acts on the exterior surfaces of the beam as natural boundary conditions. 


3.4 Finite Element Discretization 

The variational form of the partial differential equations representing the spatial dy- 
namics of a continuous beam presented in the preceding sections provide a basis for the 
finite element method to be used as a spatial discretization procedure 36 . In the present 
study, we restrict ourselves to the use of linear shape functions to approximate the dis- 
placement field along the beam, viz., 


npe 

u = £ N IUI 

/= l 


(3.24) 


where Nj denotes the spatial linear shape functions, uj represents the degrees of freedom 
at the element nodes, and npe denotes the number of nodes per element. The element 
inertia operator, from (3.4), is written as 


SF 1 = 


npe npe 

pA m e ik 

1=1 K =1 


d 2 u 


K 


d: t 2 


+ SaJ pJi M e ik 

npe 

+ E 6 °f D " 

1=1 


(3.25) 


where 


M e 


IK 


= jf N T N k d( , D e (uj)j = j («Jw)j d£ 


represent the element mass matrix and nonlinear angular acceleration vector. The former 
will be evaluated as a standard lumped mass matrix for the computational efficiency of 
explicit integration techniques to be described in Section 4, and the latter will be evalu- 
ated from an average of the element nodal angular velocities. The element internal force 
operator, from (3.16), is written as 


npe f at E npc r n \ E 

6F° = £ {6u, 6a,) [B £ ,] T {^} = Z (*■/ «»/} {|I} (3.26) 

where the evaluation of the element strain operator 



rp d /Vf 


i i N t S J 


0 SJ* ( ks Nj + ^ ) j 




(3.27) 
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and the resultant element stresses iV 7 and M K , as defined in (3.19) and (3.15) respectively, 
will be presented in detail in Section 5. The element external force operator, from (3.22), 
is written as 


SF e = J2 {«<*/ {/j} • /' * 


di 


(3.28) 


and the traction operator is implemented as boundary conditions on the nodes. The 
equations of motion in terms of nodal degrees of freedom ( Sud, Sad ) for the entire beam are 
obtained from an assembly of the above element operators. For the unconstrained beam, 
these nodal virtual displacements and rotations axe arbitrary independent variations, and 
the discrete equations of motion are written as 


M d 0 

0 J d 


5 ‘ ) + ( ° 

J>d ) ( D d {u) 


+ 



i n\ (329) 


where Md, Jd represent the assembled mass and inertia matrices, and Z?d(w), Sd, fd 
represent the assembled nonlinear acceleration, internal force, and external force vectors 
respectively. 


3.5 Extension to Multibody Dynamics 


The present formulation of spatial beam dynamics as given by (3.29) can readily be 
incorporated into a general multibody dynamics methodology. The degrees of freedom of a 
rigid body, namely the inertially-based translational position of the center of mass and the 
rotational orientation of the body reference frame, coincide with the degrees of freedom 
of the nodal coordinates of the present beam components. Thus the equations of motion 
(3.29) can be specialized to represent a rigid body system by setting the internal force Sd 
equal to zero. 


It remains to augment both the holonomic and nonholonomic constraint conditions 
modeling the contacts among the various bodies to the equations of motion. For this pur- 
pose, the Lagrange multiplier technique is used to couple the algebraic constraint equations 
with the differential equations of motion of the generalized coordinates by augmenting the 
virtual work of the unconstrained system (3.2) with the virtual work required to enforce 
the constraints. Given a set of equations representing holonomic constraint conditions 
between the displacement coordinates as 

$h(M) = 0 , S$ H = 6u = B H Su = 0 (3.30) 

ou 

and a set representing nonholonomic constraint conditions between the virtual displace- 
ments and rotations as 


( u,6u,R,Sa ) 



0 , 


(3.31) 
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the virtual work expression (3.2) of the unconstrained system is modified to' account for 
the constraint via Lagrange’s multipliers A as 37 


SF 1 + 6F S + A + A = SF E + 8F T 

The virtual displacements and rotations of the generalized coordinates can now be treated 
as arbitrary independent variations in the modified virtual work expression. The equations 
of motion for constrained flexible multibody systems with respect to a set of generalized 
coordinates ( tt,u> ) denoting both the nodal coordinates of the flexible members and the 
physical coordinates of the rigid bodies can be expressed as 


M 0 ( ii 1 

0 J\ 


+ B r A 



(3.32) 


where 


JQ«1 = f f e -S'(u,q ) \ 

l Qu ) \ f b - F{u) - S b (u , q) } 


B 



in which D(u>) represents the nonlinear acceleration, S the internal force vector, / the 
external force vector, and B T A the constraint force vector. As an additional number of 
unknown Lagrange multipliers A for each constraint condition have been introduced along 
with the generalized coordinates for each degree of freedom, the above system of equations 
must be augmented with the constraint equations themselves to achieve a determined 
system of equations. 


4. Time Integration Techniques for Constrained Systems 

The present methodology to formulate the equations of motion of an arbitrary assem- 
blage of interconnected flexible beams and rigid bodies is readily adaptable for use with 
existing multibody dynamics solution techniques. The equations (3.32) model the beam 
components with degrees of freedom u and ui that embody both the rigid and flexible 
deformation motions. As such there is no need to solve separately generalized coordi- 
nates denoting the flexible motion from a reference set of coordinates denoting the rigid 
motion. In addition, as the nodal coordinates of the beam components are defined in 
the same kinematic manner as the physical coordinates of the rigid body components, no 
distinction need be made between the treatment of the flexible and rigid components of 
the jnultibody system other than the calculation of the internal force of the flexible mem- 
ber. Therefore, the salient feature of this type of formulation is that numerical solution 
procedures for the integration of spatial kinematic systems can be directly applied to the 
generalized coordinates of both the rigid and flexible components. 

A multibody dynamics solution procedure, originally demonstrated on rigid body sys- 
tems in previous studies 38-41 , is adopted for the above flexible multibody system equations 
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of motion. The key to the procedure is a staggered implementation of the separate gener- 
alized coordinate integrator and constraint force solver modules. An improved variation of 
the explicit central difference algorithm, described in Section 4.1, is used to integrate the 
translational displacements and the angular velocity of the system. An algorithm based on 
the Euler parameter representation of finite rotations, described in Section 4.2, is used to 
update the configuration orientation from the angular velocity. The computations of the 
Lagrange multipliers axe then carried out in a separate routine, described in Section 4.3, 
which implicitly integrates a stabilized companion differential equation for the constraint 
forces in time. 


4.1 Explicit Generalized Coordinate Integrator 


The central difference explicit integration algorithm is written as 

d" + > = d n ~i + hd n 

d n+1 = d n + hd n+ ± (4.1) 

d n+1 = M~ x Q ( d n+1 , d n+1 ) 

where the superscript n = 1,2,3, ••• designates the discrete time station t n = n h and 
h is the stepsize. Unlike in conventional structural dynamics, a straightforward application 
of (4.1) on the rotational equations 


J U) ■+• U J U — fu 


inherent in the multibody system equations of motion (3.32) leads to computational dif- 
ficulties. In order to compute u>” +1 , it is necessary to have u; n+1 . However, due to the 
inherent nature of the algorithm, only u> n+ a is available. It was shown 41 that the naive 
approximation 


U! 


n+1 




(4.2) 


results in a computationally unstable integration of the angular velocity u>. To correct 
this within the context of explicit computational sequences, an interlaced application of 
the central difference algorithm such that the displacements and velocities are advanced 
one-half time step at a time was proposed 40,41 . The algorithm advances the solution to 
the time station t n+ * given the solutions of the two preceding time stations t n ~* and t n 
as follows: 
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(a) 

U n+ 2 

= 

u n ~ 2 + hii n 

(&) 

U n+ 2 

— 

ii n ~ 2 + h u n 

(c) 

u; n+ 2 

= 

a; n "2 + hu n 

(d) 

q n+ 2 

= 

q ( a > 7>+ 2 ) 

(e) 

5" + 2 

= 

5 ( u n+ *, 9 n+ * ) 

(/) 

1 D n+ 2 

= 

D ( ) , /"+* = f ( ) 

( 9 ) 

Q n+ > 

— 

Q ( / n+ 2,5 n+ 2,I» n+ i ) 

(h) 

A n+ 2 

= 

A(A",Q"+*) 

( i ) 

ii n+ 2 

— 

M -1 ( - Bl A n+ 2 ) 

(J) 

w n+ 2 

— 

j- 1 ( - Bl \ n+ > ) 


The evaluation of the generalized rotational parameters q to be obtained from the angular 
velocity, as represented by step (d), will be detailed in Section 4.2. The evaluation of the 
internal force S from the current configuration coordinates u and q, as represented by 
step (e), will be detailed in Section 5. The evaluation of the Lagrange multipliers A, as 
represented by step (h), will be detailed in Section 4.3. The algorithm proceeds to the next 
half time station t n+1 , now given the solutions at time stations t n and < n+ 2 , and thus the 
force and acceleration terms are evaluated twice each time step. The algorithm is initiated 
for time 1 2 given initial conditions for time t° in the following manner: 


(k) 


= ti° 

+ 

h „ 0 
2 “ 

(0 

a; 2 

= U!° 

+ 

h . 0 
2 " 

(m) 

u 1 

= U° 

+ 

h u 2 

(n) 

u 2 

- h 

u° 

+ u 1 ) 


from which steps ( d ) through (j) can be performed. 

One last remark will be made on the angular velocity integration. The equations of 
motion were derived using body frame angular velocity components. The integration of 
these quantities shown in step (c) is not formally correct as the components at different 
time steps are defined with respect to different body-fixed frames. This concern can be 
eliminated by applying the central difference update to the inertial components of the 
angular velocity. Step (d) will then consist of an appropriate function of inertial angular 
velocity components. The integrated inertial angular velocities must be transformed to the 
moving reference frame before evaluating steps (/) and (j) since the equations of motion 
are written with respect to the body frame angular velocity description. The angular 
acceleration evaluated in step (j) must then be transformed back to inertial reference 
frame before being integrated again in step (c). 
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4.2 Rotational Parameter Integration 


The two-stage explicit integrator was applied to the translational displacement and 
velocity coordinates and the angular velocity coordinates. As the rotational orientation 
parameters are not directly integrable from the angular velocity vector, a procedure must 
be developed to update the configuration orientation given the angular velocity. Any finite 
rotation can be uniquely expressed by a rotation angle 9 and an appropriate rotation axis 
n 42 . Two rotational parameterizations based on this description are the rotational vector 
( 0 ) and the Euler parameters ( 90, q ) defined respectively as 


0 = 9 n 


{:} - {.-II 


(4.3) 


The three parameters of the rotational vector are independent, while the four Euler pa- 
rameters are subject to the constraints 


?o + q T q = 1 

The rotation matrix is represented as a function of the Euler parameters as 


R = 


2(9 o +9 i ) — 1 2(9192 + 9093) 2(9193 — 9092) 

2 (< 7 i 92 — 9093) 2 (< 7 q + 92) — 1 2(9293 + 9091) 

2(9i93 + 9092) 2(9293 - 9o9i) 2(9 q + qj) - 1 


The body frame components of the angular velocity tensor defined in (2.6) as 


(4.4) 


(4.5) 


- T 
“b 


= RR = 


* 0 

W 3 

— u>2 

( Ui ' 

^3 

0 

CJi 

, Ub = < U 2 

U>2 

-W 1 

0 

1^3 , 


has the Euler parameter representation 42 


° I 

“ 

= 2 

l U b J 



9o 

-q 


q 

9oI-q 


m 


A similar expression for the inertial components of the angular velocity tensor 

, T 


u. = 


R t cjZR = R r R 


can be derived as 


0 I 

" 

= 2 

l J 



9o 

-q 


q 

9ol + q 




The above definitions can be inverted to yield the expressions 


{ * } - H 


0 -ul 




~T 


U) 


{:} = 


Aj, ( u b ) 


{:} 


(4.6) 


(4.7) 


(4.8) 


(4.9) 
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(4,10) 


for the body frame components and 




A e ( w e ) 


for the inertial frame components. A general representation 



q = A (u) q , 9 = { ^ } (4.11) 

will be used to denote (4.9) or (4.10) given the angular velocity description. These in- 
verse expressions are derived from (4.6) and (4.8) by incorporating the derivative of the 
constraint equation (4.4) 

9o 9o + q T q = 0 . (4.12) 


The configuration orientation is obtained from a numerical time discretization of the 
above Euler parameter - angular velocity representations. Among several possibilities, the 
approximation that satisfies the constraint condition (4.12) in the discrete sense is the 
following trapezoidal formula 

j(«" +1 - A(u,"+i) + «”) (4.13) 

Due to the structure of A, the solution matrix can be analytically inverted such that the 
discrete orientation update 

,»+■ = i [ / + | A(u"+i) ] [ I + | A(u"+i) ] (4.14) 

where 

D = 1 + ( w\ + w\ + w\ ) 

results. The final result is normalized to satisfy the constraint (4.4). The above equation 
is valid for either the body or inertial frame decomposition of the angular velocity as long 
as the corresponding form of A from (4.9) or (4.10) is used. The resulting update (4.14) 
involves only explicit computations and is readily incorporated into the two-stage explicit 
integration procedure. 


4.3 Constraint Force Solution Procedure 

A partitioned solution procedure has been employed to solve the generalized coor- 
dinates separately from the Lagrange multipliers. To effect a partitioned solution of the 
constraints, a stabilized companion differential equation for the constraint forces is formed 
by adopting the penalty procedure 38 ’ 39 . The penalty procedure uses the equations 

A// = j $h , A/v = i $n , e — > 0 (4.15) 
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as the basic constraint equations instead of (3.30) and (3.31) for the holonomic and the 
nonholonomic constraint conditions respectively. The penalty equations can be written in 
the general form, from (3.30) and (3.31), as 


A = 




(4.16) 


The numerical solution to the above companion differential equation is obtained as follows. 
The constrained equations of motion (3.32) are integrated once from (3.20) using the 
implicit integration rule 


f n+l _ £ n + $ 6 = ^ 

as 

i n+ * = 6 M~ l ( Q n+ ± - B t A n+ ± ) + i n (4.17) 

This expression is substituted into (4.16) to obtain the stabilized differential equation for 
the Lagrange multipliers 

eA n+ 2 + SB M~ 1 B t A n+ * = SB M~ x Q n+ ± + B z n . (4.18) 

The same integration rule is applied to this equation to result in the discrete update 

( e I + S 2 B M~ l B T ) A n+ * = e\ n + r” + * (4.19) 

r" + ’ = S 2 B M -1 Q n+ * + 6 B z n . (4.20) 

The same procedure can also be derived with different integration rules. The update of 
the Lagrange multipliers, performed for each half time step, is easily adapted into the 
two-stage explicit integration procedure. 


5. Internal Force Computations 


The algorithmic treatment of the nonlinear stiffness operator is addressed in this 
section. The explicit generalized coordinate integrator of the previous section requires an 
evaluation of the internal force at a current time step t n+1 from the coordinates of the 
beam configuration at that time. The internal force is first evaluated on the element level 
for all the finite elements comprising the flexible component from (3.26) as 



after which these individual element computations are assembled to form the internal force 
of the discrete beam. The necessary computations to be described are the evaluations of 
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the discrete strain operator [ B E { ] defined in ( 3 . 23 ) and the resultant stresses N 7 and 
M k respectively. 


The Timoshenko beam formulation in which the translational degrees of freedom are 
independent from the rotational degrees of freedom requires an approximation within the 
element such that these variables will be continuous across the element boundaries. Thus 
a two node finite element representing a linear interpolation of the translational and rota- 
tional variables is a sufficient discretization of the beam. To avoid the locking phenomenon, 
the interpolation of the rotational degrees of freedom associated with the transverse shear 
strain is underintegrated. After incorporating these concepts into ( 3 . 27 ), the resulting 
expression for the discrete strain operator is given by 



_it It i», s r 

l 1 f A 2‘ l3 l 

0 0 Si ( - ji ) 


which acts on the virtual displacements and rotations 


i*i sl 


s l ( f + 7I ) 


( 5 . 2 ) 


{ 8 u 1 8 u 2 8 a i Sa 2 } T 

where the subscripts refer to the element node number. The convected frame T matrix, 
body frame curvature tensor and element neutral-axis length i are constant quantities 
over the element domain, while the relative cross-section deformation S matrices are nodal 
quantities. The computation of these terms from the nodal displacement and rotation 
coordinates of the current configuration are detailed in Section 5.1. 


A stress update procedure of the form 


is used to derive the resultant stresses of the current configuration at time t n+1 from the 
resultant stresses of the past configuration at time t n . The simple additive form of the 
procedure, which was derived from the numerical integration of a rate-type constitutive 
law, is due to the use of a convected frame stress and conjugate strain decomposition. The 
resultant stress increments A 1V 7 and A M K are obtained via 



'EA 

0 

0 ' 



' GJ 

0 

0 ‘ 

A Ny = 

0 

GA 

0 

A7 , 

AM* = 

0 

EI 2 

0 


0 

0 

GA 



0 

0 

IT 1 

1 


A set of strain increments A7 and A/c, which denote the change from time t n to f n+1 , are 
defined as a finite analogy to the infinitesimal virtual strains 67 and 8k derived in Section 
2 . A specific computational procedure designed for use with this incremental interpretation 
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of the continuum-based formulation such that the computed finite strain increments are 
invariant to arbitrary rigid body motions is discussed in Section 5.2. 


5.1 Computation of the Strain Operator 

The reference frames introduced in the formulation, namely the body frame b attached 
to the cross-section and the convected frame a tangent to the deformed neutral axis, are 
computed as follows. The Euler parameters representing the orientation of the beam cross- 
section at the finite element nodes are output from the generalized coordinate integrator 
at each time step. The rotation matrices R, , representing the b reference frame at each 
element node, are thus computed directly from the Euler parameter representation of a 
rotation matrix (4.5). This matrix contains rotational information of both that due to 
the rigid motion of the convected reference frame and the transverse shear and torsional 
deformations of the cross-section relative to the convected frame. 


The neutral axis of the finite element is defined as the straight line connecting the 
two element nodes, the tangent of which is trivial, and is directly calculated from the 
translational displacements output from the generalized coordinate integrator. Given this 
tangent ai, the a 2 vector is defined as the cross product of ai with the b 3 axis of R x , 
and the remaining axis a 3 defined to complete the right-hand coordinate system. The 
computed axis { a x , a 2 , a 3 }, as shown in Figure 2, define the rows of the T matrix. 
The rotation matrices S, , defined at each element node as the relative difference between 
the element convected frame and the nodal body frames, axe thus 

Si = R, T r , i = 1,2 . (5.5) 

The procedure is an approximation applicable for moderate strains such that the S, matri- 
ces contain information solely due to transverse shear and torsional deformations 43 . The 
rotation matrices of the discrete strain operator (5.2) have thus been defined. 


The body frame components of the curvature tensor k^ defined in (3.18) as 


~T 

K S 



0 

K 3 

-k 2 ' 

( ) 

-«3 

0 

*1 

, K = < «2 / 

k 2 

“ K l 

0 

l *3 J 


axe equivalent to 

- T R. _ 'p 

K s = -Q£ R (5.6) 

as the convected frame T matrix is defined to be constant along the element domain where 
the differentiation is performed. This definition is completely analogous to the angular 
velocity tensor defined in (2.6) and motivates the use of an Euler parameter representation 
of the curvature completely analogous to the Euler parameter representation of angular 
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velocity (4.6) as a basis for the computation of the element curvature from the nodal 
rotational variables. The Euler parameter - curvature representation is 



subject to the constraints 


9o 

-q 



E ( 9a ) 


dq 


(5.7) 


Qo 2 + q T q = 1 , 


9qo dq T 

~dz qo + dz q 


0 


(5.8) 


An approximation to be used in (5.7) that satisfies the constraint conditions in the discrete 
sense 


dq 

dZ 


1 

e 


( 92 - ?1 ) , 


2 ( 9i + 92 ) 

II 2 ( 9i + 92 ) || 


(5.9) 


is evaluated using the Euler parameters of the element nodes output from the generalized 
coordinate integrator. It will be shown that this discrete computation is invariant to rigid 
rotations contained in the total nodal Euler parameters. 


The simple normalized average of the nodal Euler parameters has a physical inter- 
pretation. The Euler parameters q a correspond to an average orientation, in a geometric 
sense, of the two nodal cross-section orientations. This is demonstrated from the following 
example characterizing a state of constant curvature of a finite element shown in Figure 3. 
The orientation of the convected element frame is characterized by a rotation of an angle 
(f> about an axis n a from the inertial reference frame, and the relative nodal cross-section 
orientations are characterized by a rotation from the convected frame of angles -r and r 
about axis nj for nodes 1 and 2 respectively. The Euler parameters designating the total 

cross section orientation of the two nodes due to these combined effects can be expressed 
as 

qi = / cos I cos 2 + n ® n* sin | sin f 'l 

\ - cos | sin f n b + cos f sin f n a - sin \ sin f n a x n 6 J 


9 2 


cos ~ sin 


r 

2 


cos ^ cos ^ — n a • sin sin ^ 
n 6 + cos y sin f n a + sin § sin f 


n a 


x n 6 


} 


(5.10) 


which is obtained by applying the quaternion product rule 44 to the Euler parameter defi- 
nitions 



of the relative nodal orientations and the convected orientation respectively. The average 

of the two nodal Euler parameters (5.10) is 


9 ( 9l + 92 ) 


/ cos ^ cos 
\ cos j sin ~ 
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the norm of which is cos When normalized, the above average is identical to the average 
orientation of the two nodes given by q a . It can be shown that for this example the dis- 
cretization (5.9) when substituted into (5.7) gives the finite element curvature computation 

4 . r 

k = — sin — nfc 

which approximates the true curvature strain \rxib. The computation retains only the 
rotation parameters r originally defined relative to the rigid body orientation, and is thus 
invariant to the rigid body motions. For instances when the validity of the approximation 
is challenged, an incremental curvature computation can be made as discussed in the next 
section, from which the total curvature is obtained from an appropriate update procedure. 


5.2 Computation of the Strain Increments 


The strain increments are defined from the virtual strains 
variational operator 8 with an incremental operator A as 


(3.12) by replacing the 


A7 


dAu 

W 


+ 


0 

-A /? 3 

A /?2 


A K 


d A0 

dt 


such that A u and A/3 are finite analogs of the infinitesimal displacements and rotations 
8u and 80. For computation purposes, it becomes necessary to decompose the convected 
frame components of the virtual rotations of the of the cross-section 80 into a rotation due 
to rigid body motion 8<p and that due to deformations 8t as 


80 = 8r a + 6tp . (5.11) 

This relation is derived by substituting the following definitions 

80 t = S T 8a. T S , 8a T = 8RR T 
= 8 TT t , 8f T = 8SS t , 8tJ = S T 8f T S 

R = S T , <5R = 8S T + S 8T . 

It is noted that 8a, 8<p, and St represent moving frame or spatial components referred to 
the defining reference frame, whereas 80 and 8r a represent material components referred 
back to the convected frame. From these definitions, the incremental strain A 7 is given 

by 


- dAu ( P ) 


/ P 1 



7 _ * + ( ) 

+ ■ 

- Ar ‘* 

► 

(5.12) 


l Ar a , J 




8ip T 

into the identity 
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representing the membrane strain and transverse shear strain increments. Likewise, the 
incremental curvature representing the torsion and bending strains is given by 


A/c 


d A Tg 

dt 


(5.13) 


as the incremental rotations A ip defined from the T matrix sire constant over the element 
length. 


Essential for the use of these incremental strains is a proper definition and subsequent 
computation of the finite displacement and finite rotation increments. The incremental 
translations are defined by 

A u = u n+1 - u n (5.14) 

as the displacements are true vector quantities. The incremental rotations are defined as 
follows. Rotations are updated by the product of orthogonal matrices via either 24 


R n+1 = R (l) R" = e F R" 
= R" R (r) = R" e® T 


(5.15) 


using the rotational vectors 9 or 0 based on the spatial or material reference frames 
respectively. It can be seen from the linearizations of the left and right rotational updates 24 


R n+1 ~ R” + 6R 
SR = 6 T R n = R" 0 r 

that the virtual rotations 


6<p T = ST T t , SfJ = S T SS 

correspond to spatial and material rotation updates 

T n +! = AT T” , S n+1 =. S n AS (5.16) 

respectively. Thus A(p and A r a are defined as the rotational vectors parameterizing the 
matrices AT and AS respectively. Two different approximate methods which then extract 
this pseudovector from the given rotation matrix are used to obtain the incremental rota- 
tions. The particular approximation methods are chosen such that objective computations 
of the incremental strains (5.12) and (5.13) are achieved. 


To this end, the first two terms of (5.12) 


must be computed such that the A 9 rotation increment compensates for the rigid rotation 
contained in the displacement increment A u defined in (5.14). To accomplish this, A 9 ? is 
computed by 


A'~p T = AT 


n + \ 


— AT 




where AT n+ ! is defined from 
'p n+ ’ — 

The computation was derived from the linear approximation 

T n+1 ~ ( I + A <p T ) T n 


* = exp ( Ia* t ) T n = AT n+ » T r 


(5.18) 


(5.19) 


rewritten as 


A9 T = ( T 


n+l 


_ x r 


) X n+ * 


(5.20) 


and introducing (5.19) to achieve a skew symmetric matrix. In order to preserve rigid 
motions, the matrix T in the first term of (5.17) must be evaluated as T n+ s. This is 
shown as follows from an example of the rigid rotation of an element in which A 71 = 0. 
From (5.20), it is seen that the rotational term of (5.17) becomes 


(5.21) 
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The finite element evaluation of the displacement term of (5.17) is given by 


,dA u 

~w 


= T Y ( A u 2 - Aui ) 


(5.22) 


for the two-noded beam element of length For the rigid rotation of the second node 
about the first node, the incremental translational displacements are simply 


Aui = 0 , 


Aiio = 


e , ( < f ” +1 - v ) . 


dAu 

~df 


= t: +l 


- t? 


as the direction cosines of the rotation are contained in the first row of the T matrix. 
Thus for (5.17) to be identically equal to zero, it is necessary to evaluate (5.22) using 
T n+ i . To obtain the true stretch with respect to the neutral- axis reference frame at 
the current configuration, we simply rotate the mid-configuration computation up to the 
current configuration as 
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(5.23) 
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As in the preceding analysis, the incremental displacements for an arbitrary rotation and 
stretch are given by 


= 0 , A u 2 = ( (£ e + d) ^ n+1 - l t t* ) 

where d represents a stretch relative to the original element length i t . The rotational 
expression (5.21) remains valid, and the bracketed term in (5.23) becomes 


( J7T1 ) * ' •« 


f n +l 


Premultiplication of the above by AT n+ * results in the final computation 
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containing solely a measure of stretch regardless of the magnitude of the rigid rotation. 
The incremental rotations A T a used to compute the remaining terms, 


A72 = 


0 

-Ar as 

Ar a , 


Ak = 


d A r a 

d( 


(5.24) 


representing transverse shear and curvature strains respectively, are computed indepen- 
dently from Ay as follows. The rotation increments Ar a are obtained from the matrix AS 
defined in (5.16) denoting the relative orientation between the current deformation matrix 
S n+1 and the past deformation matrix S n rigidly rotated to the current convected frame. 
Another method to extract a rotation pseudovector from a given orthogonal rotation ma- 
trix given by 43 

2 ( AS* - AS J ) 

* , * = 1,2 (5.25) 


= 


1 + tr AS,- 


is used to define A T a at each element node. The above method yields a simpler and 
more accurate computation of a rotation vector than (5.18). Whereas (5.18) was necessary 
to compute Ay such that the rigid rotations within (5.17) are preserved, (5.25) is used 
within (5.24) as this computation is made from matrices which by construction contain 
information solely due to deformation. Given the nodal rotation increments, the locking- 
free form of the elemental shear strain is obtained from the nodal average as 
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and the elemental curvature is computed from the finite element approximation 


Ak = ^ (Ar aj - Ar ai ) 

This completes the computational procedures for the incremental strains. The detailed 
strain computations of (5.12) and (5.13) are used in (5.4) to determine the stress incre- 
ments, from which the current stress state is obtained from the update procedure (5.3). 


6. Numerical Examples 

The computational techniques, namely the staggered multibody dynamics solution 
procedure combining the generalized coordinate integrator and the constraint force solver 
discussed in Section 4 and the finite element computations of the beam internal force 
discussed in Section 5, have been implemented into a Fortran 77 software package. The 
result is a robust method which solves the present formulation of the equations of motion 
of an arbitrary assemblage of flexible beams and rigid bodies. In order to demonstrate the 
current software capabilities, the following examples highlighting the flexible motion of the 
beam component axe presented. 

The first example is included to verify the geometric stiffening phenomena exhibited 
by a rotating beam 6,18,21,28 . The beam is pinned at the left end; the other end remains 
free. The following material and geometric properties were used: 

EA = 2.8 x 10 7 lb, GA = 1.0 x 10 7 lb, El = 1.4 x 10 4 lb in 2 

pA = 1.2 Ibm/in, pi = 6.0 x 10 -4 Ibm in, l = 10 in. 

A prescribed angular rotation about the e 3 axis of 

m = I + i < cos W _ ^ rad 0 * 1 * 15 * ec 

[ (6f — 45) rad t > 15 sec 

is applied at the pinned end. The time history of the tip deflection relative to a refer- 
ence frame coinciding with the prescribed angular position and the time history of several 
configurations of the beam are given in Figure 4. As alluded to in the introduction, an 
overall steady rotation of the beam gives rise to a centrifugal force which is responsible 
for a change in the bending stiffness that cannot be predicted using linear deformation 
theories. After initial increasing tip deflections, the beam begins to stiffen as the angular 
velocity increases due to the centrifugal inertia force. As the angular velocity reaches a 
constant state, the beam then reaches a steady state phase of small vibrations. This ex- 
ample shows the capability of the nonlinear strain formulation to automatically account 
for the geometric stiffening effect. The results are comparable to those presented by Simo 


25 



and Vu-Quoc 28 . To reproduce these results with alternative methods as the substructur- 
ing technique 21 , a convergence analysis based on the selection of mode shapes must be 
performed. 

The next examples exhibit the combined large deformation and large rotation capa- 
bilities of the present formulation. In the first instance, the beam is pinned as above and 
is subjected to given initial velocity impulses exciting various deformation mode shapes 
under planar motion. The following material and geometric properties were used in order 
to witness finite deformations: 

EA = 4.0 x 10 7 lb, GA = 2.0 x 10 7 lb, El = 1.3 x 10 7 lb in 2 

pA = -98 Ibm/in, pi — 3.3 x 10 -2 Ibm in, l = 200 in. 

The initial velocity profiles with the resulting time histories of several deformed config- 
urations are given in Figures 5, 6, and 7. It is noted the versatility of the formulation 
in its ability to capture the response to a variety of situations in which different funda- 
mental modes of the beam are excited. The approach avoids the difficulty of tailoring 
the selection of modes shapes of the flexible components to the given problem at hand. 
The repeatability of the deformation shapes through time is due to the invariance of the 
internal force computations to the overall rigid motion. This property of computational 
objectivity is further illustrated in Figure 8 which shows the time history of the strain, 
kinetic, and total energy over four revolutions for the first bending mode example. The 
nature of the time integration and internal force algorithms are such that, the conservation 
of energy is retained computationally, as seen by the fact that the total energy remains 
constant over all the revolutions. Similar results, not presented within, are obtained for 
the other deformation examples. 

To present the applicability of the flexible beam component within the multibody 
dynamics framework, the final example of a spatial double pendulum is given. The double 
pendulum is modeled with two beams; a spherical joint connects the last node of the first 
beam to the first node of the second beam and also pins the first node of the first beam. 
It is noted that the joint connection can easily be accounted for from a finite element 
assemblage which leaves the rotational degrees of freedom free at the hinge location. The 
method was used to verify the results obtained using the Lagrange multiplier solver on the 
augmented equations described in Section 3.5. In the first case, the pendulum is subjected 
to a gravity field in the vertical z-direction and an initial velocity impulse in the horizontal 
x-y plane such that soley rigid motion is excited. The problem is run for four cases of 
increasing beam flexibility as follows: 


1 . 

EA = 1.0 

X 

10 4 

ib 

GA = 

0.5 

X 

10 4 

ib 

2. 

EA = 1.0 

X 

10 3 

lb 

GA = 

0.5 

X 

10 3 

ib 

3. 

EA = 2.0 

X 

10 2 

ib 

GA = 

1.0 

X 

10 2 

ib 

4. 

EA = 1.0 
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10 2 

ib 

GA = 

0.5 

X 

10 2 

ib 

with the 

remaining parameters 
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pA = 1 Ibm/in pi = .833 x 10 -3 Ibm in 1=1 in 
held constant. The initial velocity impulse, and the spatial trajectories of the mass 
center of the second beam as projected on the x-y and x-z planes is given in Figure 9. The 
trajectory of the first case coincides exactly with a rigid body solution to the problem, 
and the slight deviation of the trajectories due to the increasing flexibility can be seen. 
The energy time histories for the problem, given in Figure 10, verify the computational 
objectivity of the algorithm as again energy is identically conserved. Again, the invariance 
of the internal force calculations in the three dimensioned environment is witnessed by the 
negligible strain energy contribution for all of the flexible cases. The time integration of 
the spatial kinematics preserves the balance between the kinetic and potential energies of 
the problem. Next, the flexible double pendulum is given an initial velocity impulse to 
excite deformation motion as well as the rigid motion. For this case the parameters used 
were 

EA = 1.8 x 10 7 8 lb, GA = 0.9 x 10 8 lb, El = 1.4 x 10 8 lb in 2 

pA = ,98lbm/in, pi = 0.67 Ibm in, l = 120 in. 

The initial velocity profile, the resulting time histories of several deformed configurations 
and energy time history axe given in Figure 11, exhibiting the large spatial rotation and 
deformation capabilities of the formulation. The energy conservation is retained for the 
computations of spatial deformations. 

Further examples of large scale multibody systems are in process, and these results 
are to be presented in the near future. 


7. Concluding Remarks 

A flexible beam finite element that is readily incorporated into multibody dynamics 
applications has been presented. The beam formulation is based on fully nonlinear strain 
measures which remain invariant to rigid body motions. The model retains a Cauchy 
stress and physical strain description, and as such it can be easily interfaced with real- 
time slewing control applications as the measured strains can directly be used as a feedback 
signal without requiring sophisticated transformations. In addition, the formulation uses 
an inertial reference for the beam dynamics such that the degrees of freedom of the flexible 
component are defined in the same sense as the rigid components by including without dis- 
tinction both the rigid and flexible deformation motions. The consequence is adaptability 
into multibody dynamics methodologies as numerical solution procedures for the integra- 
tion of spatial kinematic systems can directly be applied to the generalized coordinates of 
both the rigid and flexible components. The success of the approach relies on an accurate 
computation of the nonlinear internal force term. For this reason, the calculation of finite 
strain increments has been presented which are invariant to arbitrary rigid motions of the 
beam. The proposed methodology is suitable to treat the dynamics of flexible beams which 
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undergo a variety of structural deformations in addition to the large overall motions. The 
same approach can be used in formulating other types of structural components. 
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Figure 4. Geometric Stiffening ( 5 Elements ): 

(a) Tip Deflection vs. Time 

(b) Displacement History 
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Figure 5. First Bending Mode ( 8 Elements ): 

(a) Initial Beam Position vs. Initial Velocity Profile 

(b) Displacement History 


y 




Figure 6. Second Bending Mode ( 12 Elements ): 

(a) Initial Beam Position vs. Initial Velocity Profile 

(b) Displacement History 
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Figure 9. Spatial Double Pendulum ( 16 Elements ): 

(a) Second Beam Trajectory: X-Y Plane 

(b) Second Beam Trajectory: X-Z Plane 





w (' w 

^ *"“•«•* ,*.** Pu 


;.i«: £***"""* 

c :> 


O 


c. j 


w Jo 

W 1- 






s.. a 

Ed 

C .!> 

:> 1 

H ! 

u: 




‘1 



j::) 4 

k i 



■j 

; 



> -j 


K / 



; 

4 

.. / 

.r 

«■_ 

. 



< > 

‘1 

■( 

*: 

i 



) i 

j i 


j 

1 L 



0RIG5NAL P.^SE !S 

*>:’ POOR QUALITY 


Figure 10. Spatial Double Pendulum: Energy Conservation 

(a) EA = 10 x 10 4 lb GA = 0.5 x 10 4 lb 

(b) EA = 1.0 x 10 3 lb GA = 0.5 x 10 3 lb 

(c) EA = 2.0 x 10 2 /6 = l.O x 10 2 lb 
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Parallel Simulation of Multibody Systems 
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Abstract 


A parallel partitioning scheme based on physical-coordinate variables is presented to sys- 
tematically eliminate system constraint forces and yield the equations of motion of multi- 
body dynamics systems in terms of their independent coordinates. Key features of the 
present scheme include an explicit determination of the independent coordinates, a par- 
allel construction of the null space matrix of the constraint Jacobian matrix, an easy 
incorporation of the previously developed two-stage staggered solution procedure, and a 
Schur complement based parallel preconditioned conjugate gradient numerical algorithm. 
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1. Introduction 

In the past decade, several stand alone general-purpose multibody simulation codes [l- 
11] have achieved progressive development for their capability to apply to multidisciplinary 
engineering problems to improve either control system design and verification or system 
design and dynamics analysis. As a result, these computer codes have been successfully ap- 
plied to a number of multibody dynamics (MBD) problems such as robot arm maneuvers, 
spacecrafts and ground vehicle dynamics. However, when systems become very complex, 
computational efficiency becomes a dominant concern during the preliminary design stage 
that require many analysis iterations. This has motivated us to make an effective use of 
parallel computational technology in order to speed up the dynamics analysis of MBD 
systems, thus ultimately achieving real-time simulation for large-scale problems. The is- 
sues of exploiting the parallelism that are inherent in MBD systems include a versatile 
data structure for describing system topology, an automatic procedure to generate system 
equations of motion, a streamlined incorporation or elimination of system constraints, a 
robust time integration algorithm, and an easy interpretation of the simulation results. 

In general, the equations of motion for MBD systems can be generated by employing 
a set of generalized coordinates to define the state of the system [6-8]. Note that, the 
motions of each body in the system can initially be assumed to be independent of one 
another. Kinematic relationships between bodies in the system are then imposed, which 
result in the corresponding constraint conditions. If one augments the constraint equations 
to the governing equations of motion by introducing the Lagrange multipliers, the resulting 
equations of motion are characterized as differential-algebraic equations (DAE). 

Since a closed-form solution of DAE is in general not attainable except for highly sim- 
plified problems, two different approaches have been developed for the solution of DAE. 
The first approach adopts so-called constraint stabilization methods [12,13,17-19,23] which 
integrate and solve DAE while attempting to satisfy the constraint equations. From compu- 
tational point of view, this approach utilizes a large number of equations yet preserves the 
sparsity of the solution matrix and simple expression for the kinematic relationships. The 
second approach eliminates system dependent coordinates which is equivalent to eliminat- 
ing the Lagrange multipliers from DAE so that a set of second order differential equations 
can be obtained. Schemes [7,10,20-22] leading to such approach include the generalized 
coordinate partitioning (GCP) scheme, the singular vales decomposition (SVD) scheme 
and the null space (NS) scheme. In contrast to the first approach, the second approach 
enjoys a minimal set of equations of motion but suffers from dense solution matrices and 
highly nonlinear kinematic descriptions. 

Numerical experience indicates that constraint stabilization methods are generally 
preferred for closed kinematic loops whereas constraint elimination methods are better 
suited for open kinematic links. The objective of the paper is to present a parallel constraint 
elimination algorithm by constructing the null space of the constraint Jacobian matrix, and 
employ a parallel preconditioned conjugate gradient numerical algorithm to solve for the 
equations of motion that are given in Schur complement form. 

To address the present natural partitioning scheme, the paper is organized as follows. 
Section 2 presents the equations of motion that have been derived in DAE form. Section 3 
describes the natural partitioning scheme in detail with several example problems. Section 
4 applies a parallel computational algorithm to the second order differential equations. 
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Section 5 describes a parallel preconditioned conjugate gradient scheme that is used to 
find the solution for the independent accelerations. Section 6 reports on some preliminary 
results that were obtained using the natural partitioning scheme and the staggered solution 
procedure that has been previously developed [15,16]. 

2. Equations of Motion for Multibody Systems 

The equations of motion for a MBD system can be derived and expressed in various 
forms depending upon the type of coordinates one has chosen to describe the configuration 
of the bodies in the system. In the present derivation, a spatial position vector with respect 
to an inertial reference frame is described by using Cartesian coordinate. A body-fixed 
coordinate is then attached to the center of mass of each body. The position of a body is 
then defined from the origin of the inertial reference frame to the origin of the body-fixed 
frame, and the position of a particle at the body is defined from the origin of the body-fixed 
frame to the particle. A velocity vector u contains the translational velocity r which is 
defined by the inertial frame and angular velocity u which is defined by the body-fixed 
frame. When d’Alembert’s principle of virtual work is applied to the entire system plus 
the constraint equations via Lagrange multipliers, the equations of motion for a multibody 
dynamics system with n physical coordinates and m constraints can be expressed in the 
following DAE form : 


Mu + B r A = F (2.1) 

with holonomic constraints 

*(u) = 0 (2.2) 

The first and second time differentiation of (2.2) yield 

$(u) = Bu = 0 (2.3) 

and 

$(u) = Bu + Bu = 0 (2.4) 

where M is the n x n constant mass matrix, B = $ u is the m x n constraint Jacobian 
matrix, A is the m corresponding constraint forces, F is the n generalized forces that 
include external forces and inertia forces due to centrifugal acceleration, and u consists of 
the translational and rotational accelerations. 

Note that, for each body u consists of three translational velocity components ex- 
pressed in the inertial frame and three angular velocity components expressed in the 
body-fixed frame. In other words, they are physical coordinates which are a particular 
set of generalized coordinates. In addition, due to the present representation of trans- 
lational motion (referred to an inertial frame) and rotational motion (expressed in the 
convected frame), the task for identifying the dependent and independent coordinates for 
the system constraint equations becomes straightforward, thus leading to the development 
of the present natural partitioning scheme. 
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3. A Natural Partitioning Scheme 

In this section, the Lagrange multipliers are eliminated from (2.1) and a set of sec- 
ond order differential equations are derived in terms of system independent coordinates. 
To determine the system independent coordinates, a natural partitioning scheme is pro- 
posed to efficiently construct the null space of the constraint Jacobian matrix. A parallel 
methodology is demonstrated if system topologies consist of a number of tree structures. 
For a system that contains closed-loops, a cut-joint technique is used so that the present 
scheme can be equally applied. 

3.1 Constraint Elimination Method For DAE 

In constraint elimination, the main task is to find a projection matrix A such that, 
when its transposed is post-multiplied by B r A, we have 

A r B r A = 0 (3.1.1) 

This projection matrix can be obtained by expressing the physical velocity u in terms of 
the independent velocities u* as 


u = Au* (3.1.2) 

Time differentiation of (3.1.2) gives 

u = Au’ + Au* (3.1.3) 

Substituting (3.1.2) into (2.3) yields 

Bu = BAii* = 0 (3.1.4) 

Since u* is a set of independent velocities and in general u* ^ 0, (3.1.4) implies 

BA = 0 ; A r B T =0 (3.1.5) 

where A is called the null space of the constraint Jacobian matrix B. Once A is con- 
structed, pre-multiplication of (2.1) by A r yields 

A r Mu + A t B t A = A r F (3.1.6) 

By (3.1.1), the second term on the left hand side of (3.1.6) is equal to zero, hence the 
above equation reduces to 

A t Mu = A t F (3.1.7) 

Substituting (3.1.3) into (3.1.7) yields the desired equations of motion in terms of their 
independent velocities u* as 

A t MAu' = A r F - A t MAu* (3.1.8) 

Once the right hand side of (3.1.8) is obtained, the system equations can be written in the 
following form : 

M*u‘ = b (3.1.9) 
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where 


M* = A r MA 


(3.1.10) 

(3.1.11) 


b = A r F - A T MAu < 

3.2 A Natural Partitioning Scheme For Open-Loop MBO Systems 

To demonstrate the present natural partitioning scheme for open loop systems, a 
three-dimensional triple-pendulum problem (Fig. 1) is chosen. The constraint equations 
for this problem can be written as 


'toll 

0 

0 


til ] 

toll 

[B 22 ] 

0 

< 

'«2 

0 

[•#32 j 

[# 33 ] 


i “3 J 


\H\ 

[#«n] 

0 

0 

0 0 


«! ] 

[/] 

[#«2l] 

[-/] 

[#*22] 

0 0 


«2 

0 

0 

[/] 

[#.32] 

[-1] [#. 33 ] . 


1 «3 j 


(3.2.1) 


where the bodies in this pendulum problem are connected by three spherical joints and R a 
are function of rotational operators and position vectors from the center of mass of each 
body to the position of their connecting joints. To obtain the necessary projection matrix 
A, we start with the first row of (3.2.1) : 


#ntii — [ — -f > Rail ] tix — 0 (3.2.2) 

that can be partitioned into 

[Bulbil] {*?} =0 (3.2.3) 

or 

BfjtiJ + B i ll u\ =0 (3.2.4) 

where B = — J, = R s n, and d represents the dependent coordinates and t represents 
the independent coordinates. Since j x | ^ 0, the dependent velocity components of first 
body can be calculated as 

uf = -Bi~ l B ii«i = Pxtii (3.2.5) 

where P\ = — B X1 1 B\ x = fi,n. The velocity vector of first body tii can be written in 
terms of independent velocities u*j as 


«i = 



(3.2.6) 


where Q\ 



Likewise, f? 2 2 of the second row of (3.2.1) can be partitioned into 



or 


(3.2.8) 


U2 — B 22 + B 22 U 2 ) 

for IB22 1 ^ 0. Substituting (3.2.6) into (3.2.8) yields 


tij — —B 2 2 


d _1 


+ -^22^*2) — -^l^i "b ^2^2 
-1 


(3.2.9) 


where R\ — —B 22 B 21 Q 1 — B 2X Qi t and R 2 = —B 22 X B 22 = B 22 . The velocity vector 

of second body, 1*2, can be expressed in terms of the independent velocities, u\ and , as 

where Si = ^ and S 2 = ^ ^. 2 ^ . Applying the same procedure to the third row of 

(3.2.1), U3 can be expressed as 

«a = ~B d 33 {B 32 u 2 + Bi 3 ttJ) = + 5 2 u l 2 ) + B l 33 u‘] 

= Vittj + + V 3 U 3 (3.2.11) 

where Vj = —B 33 B 32 S\ = B 32 S X , V 2 = —B 33 B 32 S 2 = B 32 S 2 , V 2 = —B 33 *B 33 = 

S33, and U3 can be written in terms of tt\, u 2 , and u 3 as 

*»={$}- [0 0 (3 - 2 - i2 > 

where W x = ^ ^ ^ , W 2 = ^ ^ , and W 3 = ^ . Combining (3.2.6), (3.2.10), and 

(3.2.12), we construct the physical velocities u in terms of u* as 


(3.2.13) 


' Ml ' 


Qi 

0 

0 ' 

u 2 

► = 

s 1 

S 2 

0 

„ “3 J 


Wi 

W 2 

^3 



or 


u = Au* 


(3.2.14) 


where A is the null space of the constraint Jacobian matrix that has been exploited in the 
previous section. Note that in the process of forming A, the inversion of the dependent 
matrices can be obtained analytically as opposed to the generalized coordinate partitioning 
scheme that the inversion of the dependent matrices have to be carried out numerically. 
The scheme for constructing A provides a guideline to deal with MBD systems containing 
different topologies such as multiple open kinematic links and closed kinematic loops, which 
will be discussed in the following sections. 
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3.3 Natural Partitioning Scheme For Multiple Open Chain Systems 

If the MBD systems have more than one branch as shown in Fig. 2, the present 
scheme lends itself to multiprocessor computers. This property can be demonstrated by 
the following MBD system where the constraint equations are given by 


r — ~ 

O 

O 

o 

o 


• 

«1 

[I?2i] [B22] 0 0 0 


«2 

0 [#32] [£33] 0 0 

< 

«3 

[B 41 ] 0 0 [B 44 ] 0 


«4 

. 0 0 0 [B 54 ] 1 * 55 ] - 


. “5 , 


(3.3.1) 


Applying the proposed scheme, the A matrix is selected as 


• > 
Ui 


Q 1 

0 

0 

0 

0 ' 


r*n 

ti 2 


Si 


0 

0 

0 


“2 

«3 

► = 

Wi 

w 2 

W 3 

0 

0 

< 

"3 

«4 


Yi 

0 

0 

Y 4 

0 



1 tis , 


.Z x 

0 

0 

Za 

z§ . 


k “5 > 


(3.3.2) 


Note that, in the natural partitioning scheme, once the first row of (3.3.2) is constructed, 
the second and fourth row of (3.3.2) can be constructed simultaneously according to given 
Q\. Again, if the first, second, and fourth rows of (3.3.2) are found, the third and fifth 
rows of (3.3.2) can be obtained according to their dependent branches respectly. Since 
MBD systems are the systems that include many kinematic loops, it is natural to utilize 
this development in a multiprocessor computer to compute the null space (at each branch) 
of the constraint Jacobian matrix. 


3.4 Natural Partitioning Scheme For Closed-Loop MBD Systems 

When the systems have one or more closed loops, difficulty arises in constructing the 
null space of the constraint Jacobian matrix as one will see from examining the following 
three body crank-slider problem (Fig. 3). The constraint equations for this problem are 
given by 


'[B u ] 

0 

0 


( * 

[B 21] 

[Baa] 

0 


Ul 

0 

[B32] 

[Baa] 

< 

« 2 

V, 

0 

0 

[B43] . 


l U 3 J 


= 0 


(3.4.1) 


It is obvious that joint 1 and 4 conflict in determining the null space of (3.4.1) according 
to preceding scheme. Fortunely, there is a technique to overcome this difficulty. The 
technique is called “cut joints” which means cut the joints that are necessary to force the 
system topologies to become open loops so that the existing solution procedure could be 
adopted. This technique is accomplished by partitioning (3.4.1) into the following form 
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(3.4.2) 


or 


[fill] 
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l «3 J 
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B 0 u = 0, B c u - 0 


(3.4.3) 


where B 0 represents the open loop constraint Jacobian matrix, and B c represents the 
remaining constraint Jacobian matrix after the joints have been cut. Performing the 
natural partitioning scheme to construct the null space of B„ as 


B 0 A 0 = 0 ; AjBj = 0 (3.4.4) 

Performing algebraical calculations as in section 3.1 yields the equations of motion for a 
closed-loop MBD system as 


Mu + B + Bf A c = F (3.4.5) 

Premultiplying Aj to above equation yields 

AJMu+aJbJA c = AjF (3.4.6) 

which can be solved either by employing the penalty constraint stabilization technique 
(P.C.S.T.) or by constructing the null space for the new equations of motion. 

4. A Solution Procedure for MBD Systems 

A common procedure for solving DAE is to augment (2.1) and (2.4) into the following 
system of differential equations 


M B r 1 f ul _ f F 1 
B 0 J { X J \ -Bu J 


(4.1) 


so that numerical ordinary differential equation solvers can be applied. The drawbacks 
of this approach are : first, (2.4) does not represent the original constraint equations 
(2.2) ; second, the violation of the constraints occurred during the process of numerical 
integration. A constraint stabilization technique proposed by Baumgarte can be used to 
stabilize (4.1). The disadvantages of this technique have been studied and a new stabilized 
technique has been developed in [12,13] so that constraint violation can be stabilized 
efficiently. An alternative approach to avoid constraint violation is to obtain the null 
space of the constraint Jacobian matrix as suggested in the present scheme. De Jalon et 
al. have developed a formulation using the so called natural coordinates so that similar 
equations of motion to (3.1.8) are obtained. The drawbacks of their approach has been 
discussed in [10,11]. A solution that avoids these drawbacks can be achieved by augmenting 
(3.1.7) and (2.3) into 
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(4.2) 


A r M' 

B 


u = 



which not only destroys the symmetry of the matrix in (4.2) but also violates the constraint 
conditions when time integration algorithms are used. The following section discusses an 
approach that overcomes these difficulties with parallel computation in mind. 


4.1 Application of Parallel Computations 

Since MBD systems may involve hundreds of bodies, solution for such systems require 
large amounts of computations. For the purpose of real-time simulation, existing parallel 
computers need to be utilized and new numerical algorithms need to be developed in 
order to speedup the solution process. So, instead of solving the second order differential 
equations (3.1.8), we augment (3.1.3) and (3.1.7) into the following form : 


' -M MAl / u 1 _ / -MAu* \ 

A r M 0 j \ u* j \ A t F J 

Following [24,25], we can partition M, u, and MA into the the following form 


(4.3) 
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... 
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. U* y 


d J 


(4.4) 


where n is the total number of bodies in the system. The above system with an arrow 
head matrix (4.4) can be written as 


MjUy -t- ®(y,»+i)^ cy , j l,...,n 

n 

S D (»+W) fl i = d 

3 - 1 


where 


E D <»+w> = £ A f M i 

3=1 3=1 

D (i,n+l) = My Ay, j = 1, ..., n 
Cj = — (MAu*) ; , j = 1, ...,n 

■< = EA' 

3 = 1 


(4.5) 
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Each diagonal submatrix My represents the local mass matrix which is decoupled and can 
be factorized concurrently. An off-diagonal submatrix Dy denotes the coupling between 
two connecting bodies in the system. Since M is a constant matrix, (4.5a) becomes 

Uy = M- 1 (D ( y, n+1) u‘ - Cy) (4.6) 

Substituting (4.6) into (4.6b) gives the well-known Schur complement 

n n n 

^ D(n+l,j)My D(y irt+ x))u = y>(n-H.y)M,- *Fy — D( n+1> y) Au* (4.7) 

J = 1 J = 1 j= 1 

where (3.1.8) is recovered. Several aspects of the present procedure have been observed : 

(1) The parallelism in the multibody system is exploited by mapping each processor onto 
a group of bodies so that independent computations such as the left hand side of (4.7) 
can be performed concurrently. 

(2) Since Mi is a constant mass matrix, it needs to be factored only once. 

(3) To solve for u', a parallel sparse solver such as described in [25] may be utilized. 

(4) Once u* is obtained, the evaluation of u from (4.6) is trivially parallelized. 

4.2 Parallel Solution Procedure for MBD Systems 

The solution procedure using the natural partitioning scheme can be summarized with 
the following steps : 

[1] Construct A at step n. 

[2] Solve (4.3) at step n for u, and u*. 

[3] Integrate translational and angular velocities from n to n + 1 by using u, and u*. 

[4] Integrate translational displacements and angular orientations from n to n + 1 by 
using u, and u*. 

It is known that current MBD programs, which are developed in the last twenty years, 
were tailored for sequential computers with core memory limitations. Limited core memory 
is an issue motivating researchers to develop sparse matrix method that will dramatically 
decrease computer storage. In selecting a solution scheme from a multiprocessing system, 
iterative solution methods are often preferred over direct methods because they require 
fewer synchronization and / or interprocessor communication. Most studies of MBD algo- 
rithms often assume that the system equations have already been formed. As indicated in 
(4.5), the system equations can be generated independently and in parallel. It would be 
natural if the solution scheme can be processed at body-by-body level without forming the 
system equations. Among the iterative solution methods, the conjugate gradient method 
appears to be the most promising candidate because of its inherent parallelism [24-26]. 
The following parallel PPCG scheme, which is specified to MBD systems, is summarized 
into two steps with (4.1.9) as the system equations : 

(1) Solve in parallel using all the processors M*u* = b 
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• Form the right hand side of the Schur complement : 

For j = 1 to N p do concurrently 
Form T r (j) = 

Form 6(j) = d(j) - D(j)T r (j ) 

• Initialize : 

Xq = 0 

r 0 = b 

For k = 1, , n 

If rjt_x = 0 then quit 
Else 

• Compute new conjugate search direction : 

Solve Pzfc-i = for z k -\ 

Pk = zl_ x r k - x lzl_ 2 r k - 2 (/?i = 0) 

Pk = + /?*?*-! (pi = 2o) 

• Form the left hand side of the Schur complement : 

For j = 1 to N p do concurrently 
Form Ti(j) = D T (j)p k (j) 

Form T,(j) = 

Form M(i)*p fc (j) = -D(j)Ti(j) 

• Line search to update solution and residual : 

= zl_ 1 r k . i /plM*p k 

Xk = X k -1 + <*kPk 
r k = r*_ i - a k M *p k 
Endif 

(2) Broadcast the part of x corresponding to the handled rows of D to neighboring pro- 
cessors and solve for u as in the following steps : 

For j = 1 to N p do concurrently 

• Receive x 

• Back substitute for u 

• Send u to host for output 

As noted in (4.7), the conjugate gradient method is used to obtain system independent 
variables without forming the null space matrix of the constraint Jacobian matrix. The 
reason is that the major operation of the conjugate gradient involves the multiplication of 
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a matrix by a trial vector. Thus, we can rewrite (4.7) as 

v = BM~ 1 B T p 

n 

= ^2 B (n+l,j)M~ l B U>n+1) p 

1 = 1 

= Yl B {n+l,j)M~ 1 p e ^ 

1 = 1 
n 

= J2 B (n+l,j)'’ e 
1 = 1 

where v* = , ..., u^]. This multiplication is performed in three steps, and they 

add different contributions from prospective bodies to the entry of the resulting vector. 
The matrix-vector multiplications are performed directly on the body level and resulted 
in the global vector v. 

Preconditioning can be used to accelerate the convergence of the conjugate gradient 
method. This is achieved by solving the modified system 


PM'x = Pb (4.9) 

where P is the preconditioning matrix. Selection of an optimal preconditioner for present 
MBD problems will be addressed in future work. 

A prototype code for dynamics analysis of MBD systems on a shared-memory mul- 
tiprocessor is currently under development at Center for Space Structures and Controls 
(CSSC). The software architecture and the numerical algorithm presented in this paper 
are part of the code. A test version called PMBS (Parallel Multi-Body System) has been 
implemented on the Alliant FX/8 by using Force macros [29]. Several example problems 
have been experimented and the results will be shown in the following section. 

5. Numerical Examples 

Computer simulation of two MBD systems has been examined in this section by 
using the scheme and the algorithm developed in previous sections. The resulting robust 
algorithm solves the present equations of motion of any arbitrary system topologies. 

5.1 Three Dimensional Three-Link Manipulator 

In order to validate the feasibility, effectiveness, and accuracy of the present scheme, a 
three-link manipulator, which has been studied by Gawronski and Ih [27,28], was performed 
under the given specifications. The manipulator is under a specified nonholonomic tip 
velocity constraints throughout the whole simulation as shown in Fig. 4. The joints that 
connected the link are modeled as spherical and revolute joints. Initially, the Lagrange 
multipliers are introduced to enforce the joint constraints as well as the nonholonomic 
constraints at the tip of the manipulator. The Lagrange multipliers are then eliminated by 
adopting the present scheme so that the numerical algorithms can be performed. When 
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time stepping, the manipulator is maneuvering under the desired trajectories which are 
given in two different vertical planes as illustrated in Fig. 5 and 6. The corresponding joint 
velocities and accelerations, which are matched quite closely to the results that are given 
by Gawronski and Ih, are shown in Fig. 7 and 8. Numerical experiments, although not 
reported herein, show the present scheme and algorithm provide considerably less CPU 
time than the one with the penalty constraint stabilization technique due to the number 
of the operation counts. These will play an important role in the real-time simulation. 

5.2 Double- Wishbone Auto-Suspension Systems 

To explore the parallelism of the present scheme, we select a vehicle model with 
multiple suspension systems, in which the input data describing this system are provided 
by Nikravesh of the University of Arizona, as shown in Fig. 9. According to the scheme 
used in section 3, the vehicle can be easily partitioned into four subsystems where four 
independent processors can be assigned to each of the subsystem so that the null space of 
the constraint Jacobian matrix can be constructed in parallel. Note that the suspension 
systems possess four sets of springs and dampers with given locations, spring and damping 
coefficients. The tires of the vehicle are modeled by using unilateral spring elements. 
Initially, the vehicle is positioned in a height of one meter from the ground with initial 
velocities equal to zero. When the vehicle is been released, gravity acts as the external 
loads that force the vehicle to fall. Fig. 10 illustrates one of the spring that reacts to 
the given external load during one second simulation run time. The displacements of each 
body, which simulate the behavior of the bodies in this system, are given in Fig. 11-15. 
The interesting features of this simulation are the CPU time consumption (Fig. 16) and 
the speed-up (Fig. 17) of using different processors in Alliant FX/8. Note that present 
scheme (N.P.S.) has been used to compare the results that have produced by previous 
developed penalty constraint stabilization technique. 

6. Conclusion 

An efficient numerical method for the dynamic analysis of MBD systems has been pre- 
sented. A scheme that requires less CPU time to generate the null space for the constraint 
Jacobian matrix has been developed. The present scheme, which is robust for kinematic 
chains with variable degrees of freedom, provides the system independent coordinates that 
can be integrated without violating the kinematic constraint conditions. A parallel pre- 
conditioned conjugate gradient is also developed to solve the system governing equations 
of motion which are written in the Schur complement form so that parallel computations 
can be applied. Finally, the application of two example problems, dealing with holonomic 
and nonholonomic constraints, show the generality of the scheme and its capability for a 
general purpose computer program for the dynamic analysis of MBD systems. 
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Fig. 3 The Crank-Slider Problem 
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Fig. 16 Comparison of Total CPU Time Used by Both Techniques on Alliant FX/8 



Fig. 17 Comparison of Speed Up by Both Techniques on Alliant FX/8 





Stabilization of Computational Procedures for 
Constrained Dynamical Systems 

K. C. Park and J. C. Chiou 



Reprinted from 

Journal of Guidance, Control, and Dynamics 


11 Number 4, July-August 1988, Pages 365-370 

mrriTi ITT r»C ACDOKlM IT1PC AWf^l A 


370 L ENFANT PROMENADE, SW • WASHINGTON, DC 20024 



VOL. II, NO. 4, JULY-AUGUST 1988 


J. GUIDANCE 


365 


PRECEDING PAGE BLANK NOT FR. 
Stabilization of Computational Procedures for 
Constrained Dynamical Systems 

K. C Park* and J. C. Chiouf 
University of Colorado , Boulder , Colorado 

A new stabilization method of treating constraints in multibody dynamical systems is presented. By tailoring a 
penalty form of the constraint equations, the method achieves stabilization without artificial damping and yields a 
companion matrix differential equation for the constraint forces; hence, the constraint forces are obtained by 
integrating the companion differential equation for the constraint forces in time. A principal feature of the method 
is that the errors committed in each constraint condition decay with its corresponding characteristic time scale 
associated with its constraint force. Numerical experiments indicate that the method yields a marked improvement 
over existing techniques. 



I. Introduction 

T HE dynamics of flexible multibody systems, such as the 
design of robotic manipulators, mechanical chains, and 
satellites, is becoming increasingly important in engineering. 
Computer simulation of such multibody dynamical (MBD) 
systems requires a concerted integration of several computa- 
tional aspects. These include selection of a data structure for 
describing the system topology, computerized generation of 
the governing equations of motion, incorporation of con- 
straint conditions, implementation of suitable solution al- 
gorithms, and easy interpretation of the simulation results. 

Traditionally, the task of formulating the equations of 
motion has been of dominant concern to many dynamists. As 
a result, several MBD formulations have been proposed; these 
differ primarily in the manner in which they incorporate 
constraints and in their resulting system topologies . 1 " 13 Hence, 
reliability and cost of existing MBD simulation packages have 
been strongly affected by how well the equations of motion 
have been streamlined and how well the constraints arc pre- 
served during the numerical solution stage. 

As dynamists face more complex problems, particularly in 
the field of large space structures, a new consensus is emerg- 
ing: MBD simulation requires a data structure that can 
accommodate various system topologies. A primary motiva- 
tion for espousing a maximum flexibility in the data structure 
is to allow, for each subsystem of a complex MBD system, the 
adoption of different modeling assumptions, different formu- 
lations of the equations of motion, and different solution 
techniques. Once this need is recognized, compatibility of 
subsystems as well as of various constraints becomes a focal 
computational issue However, enforcing such subsystem and 
kinematical compatibilities leads to a formulation that in- 
volves a set of auxiliary constraints that must be satisfied at 
each integration step 

Because it is important in the simulation of MBD systems 
to treat the resulting loiisti units accurately and reliably, several 
computational procedure^ have been proposed. These include 
the technique for condensing dependent variables via singular- 
value decomposition b\ Walton and Stccvcs , 14 equilibrium 
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correction strategies by Baumgarte , 15 16 penalty formulation 
by Orlandca ct al . 17 and Lotstedt, 1 * the coordinate partition- 
ing technique by Wehage and Haug , 14 and the differ- 
ential /algebraic approach by Gear 20 and Petzold . 21 In ad- 
dition, recent reports by Huston and Kamman , 23 Fuehrer and 
Wallrapp , 24 Schwertassek and Roberson , 25 and Nikravesh 26 
address various related techniques. 

Among the procedures cited, it is generally agreed that 
Baumgarte’s technique is the most reliable one for handling 
constraints. Thus, we believe that new methods for constraint 
Stabilization should be compared with Baumgarte’s technique. 
However, an examination of Baumgarte’s technique has re- 
vealed that it has three important algorithmic and software 
difficulties. 

First, according to Baumgarte’s formulation that leads to 
his constraint stabilization, the error committed in all the 
constraint conditions during time integration steps can decay 
only with a uniform characteristic time constant. In other 
words, each of the constraint equations converges at the same 
rate regardless of its physical nature. This uniform conver- 
gence rale masks an important physical phenomenon: the 
characteristic time constants of each constraint equation are 
different, since Lagrangian multipliers associated with the 
constraint equations exhibit different physical response char- 
acteristics. Hence, Baumgarte’s technique does not exploit the 
well-known observation that the principal errors in multi- 
degrcc-of-freedom systems behave the same way as do those 
associated with the individual physical components. 

Second, Baumgarte’s technique requires that the solution 
matrix, B 1 M '/?, can be invertible, where B is the gradient of 
the constraint equations and M is the mass matrix. It is noted 
that the solution matrix becomes singular (or ill-conditioned) 
if two or more constraints become numerically dependent (or 
almost dependent) upon one another. When that happens, the 
potential gain in accuracy realized by Baumgarte’s stabiliza- 
tion is lost. 

Third, Baumgarte’s technique requires the solution of an 
augmented matrix equation that involves the constraint gradi- 
ent matrix B. This means that whenever additional con- 
straints are introduced or when some of the constraints are 
relaxed, the matrix profiles of the total-system equations will 
have to be varied. The task of dynamically varying matrix 
profiles of the total-system equations can significantly com- 
plicate software implementation. 

The objective of the present paper is to report a new 
stabilization technique that is aimed at mitigating the three 
algorithmic and software difficulties of Baumgarte’s technique. 
First, the new technique induces the errors in the constraint 
equations to decay according to their principal characteristic 
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response time constants; the principal errors in the constraint 
equations diminish according to their corresponding physical 
response characteristics. Second, the new technique overcomes 
the nonconvergence difficulty when two or more constraints 
become numerically dependent. Third, the new technique yields 
a matrix differential equation for the constraint forces. Hence, 
the solution of the constraint forces can be carried out in a 
separate module from that for the primary solution variables 
(the position vector for the dynamical equations). To this end, 
the paper is organized as follows. 

Section II presents a review of the Lagrangian X-method 26 
for formulating the equations of motion with constraints, 
including both configuration ( holonomic ) constraints and 
motion (nonholonomic) constraints. An examination of 
Baumgarte’s stabilization for constraints is offered in Sec. Ill, 
delineating in detail the three noted algorithmic and software 
implementation difficulties of the Baumgarte stabilization 
technique. 

Section IV presents a new stabilization technique based on 
a control synthesis approach. First, we introduce the well- 
known penalty technique so that the constraint forces are 
made proportional to violations of the constraint conditions. 
Second, by tailoring the governing equations of motion and by 
augmenting the constraint equations with the tailored form of 
the equations of motion, a stabilized differential form of 
constraint equation is derived. The resulting stabilized con- 
straint equations are shown to be matrix differential equations 
with the constraint forces as the primary solution vector, yet 
possessing no artificial damping as is the case with Baumgarle’s 
technique. Hence, one is left with a set of coupled differential 
equations of motion in which the generalized displacements 
and the constraint forces form a conjugate pair of unknowns. 
It should be mentioned that a similar approach has been 
successfully utilized for the solution of fluid-structure interac- 
tion equations 27 and of fluid-porous soil interaction equa- 
tions 28 when the interaction equations are partitioned 29 30 and 
solved in a staggered manner. For this reason, the present 
method will be called a staggered stabilization technique. 

Section V reports numerical experiments that illustrate the 
improved performance of the present staggered stabilization 
technique. 31 The paper ends with concluding remarks regard- 
ing computer implementation issues in production-level MBD 
simulation modules. 


II. Equations of Motion with Constraints 
The Lagrangian equations of motion for mechanical sys- 
tems with constraints can be written as 


A Ik 

dt dq, 



£ ^ k Bki * 


k - 1 


= ”° 


< = l...n (1) 

( 2 ) 


where L is the system Lagrangian, are the constraint 
conditions imposed either on the subsystem boundaries or on 
the kinematical relations among the generalized coordinates, 
q t are the generalized coordinate components, / is time, (') 
denotes time differentiation, X is the Lagrangian multiplier, 
Q t is the generalized applied force, and B kl is the ith gradient 
component of the k\h constraint equation, Eq. (2). 

In order to focus our subsequent discussions, we specialize 
Eq. (2) to the holonomic (configuration) case: 


. d$. 

<M?)=0, 

and to nonholonomic (motion) case: 

* 3 ** 


k — 1 . . . m 


(3) 


(4) 


It should be noted that the constraint forces Q' are obtained 
by 


ft'-EMt- <-I (5) 

k - 1 

and not by \ k alone. 

Because the two constraints give rise to two different sets of 
equations of motion, we will treat their time discretization 
separately. It should be mentioned that a typical MBD system 
involves both cases; hence, the solution procedure should 
account for the two constraints concurrently. 


Systems with Nonholonomic Constraints Only 

When the system involves only nonholonomic constraints, 
the equations of motion become 

[? ?]{*}-{?} “> 

where M is the mass matrix, Q consists of the applied force 
Q , the centrifugal and Coriolis force, and the internal spring 
force, and c is given by 


dt 


(7) 


Systems with Holonomic Constraints Only 

When the system involves only holonomic constraints, the 
equations of motion become 


M 

°1( 

?w[° 

s r 

(*)- 

A 

. 0 

oj\; 

M [b 

0 . 

lx/ 

l c I 


III. BaumgarteVs Stabilization Technique 
In BaumgarteVs technique, one replaces the second row of 
Eq. (6) for the case of nonholonomic constraints by 

<J> + y<J>-0 (9) 

Hence, the right-hand side of the second row of Eq. (6) is 
modified as 


c — 



( 10 ) 


Baumgarte sketched a solution scheme that uses the given 
parabolic stabilization technique as follows. First, the para- 
bolically stabilized equation may be expanded as 

Bq+- d T + y*-® (ii) 


By substituting q from the first row of Eq. (6), one obtains for 
X in the form 


( BM ] B t )\ — BM [ Q + -gy + (12) 


Hence, X in the preceding expression can be substituted into 
the governing equations of motion to yield 

Mq~Q- B T (BM 'B T y'{BM 'Q+ (13) 

which can be integrated by an explicit integration formula. 
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For holonomic cases, he has recommended the following 
integro-differential form: 16 

<i> + 2y4> + y 2 ( 4>df“0 (14) 

■''o 

so that one obtains 



In the paper where Baumgarte presented this procedure, no 
solution scheme was suggested, except that he advocated the 
adoption of generalized momenta as the primary variables. In 
the present context of the generalized coordinates q , a plausi- 
ble implementation of the stabilized integro-differential con- 
straint equations may be realized as follows. First, one in- 
tegrates the governing equation of motion, Eq. (8a), by an 
implicit integration formula 

q n + l -Sq** 1 + h n q (16) 

where 8 is a formula-dependent stepsize and h n q is a historical 

vector. For example, for the trapezoidal rule, we have 

*-(»„♦. -o/2. ( 17 ) 

Integrating the equations of motion with holonomic con- 

straints once, by the preceding implicit formula, one obtains 

q n 41 -«A#“ l (g-l> r X" + l ) (18) 

We now substitute the preceding equation into the stabilized 
integro-differential constraint equation, Eq. (14), to yield 

8BM~ l B T X n + l -SBA/" l e + /i" + 2y<I> + y 2 j%dt (19) 


After substituting the given expression for X, one can in- 
tegrate the resulting equation to obtain + l by either an 
implicit or explicit integration formula. We now offer the 
following remarks. 

Remark I 

Each of the constraints for both the holonomic and nonho- 
lonomic cases, { , k — 1 . . . m } , possesses the same para- 
bolic time constant y, since its solution can be expressed as 

<t> A -Qe“ y \ k — l ... m (20) 

Note that the errors committed in each of the constraints also 
decay with the same single time constant. However, regardless 
of their physical time constants, the errors in the constraint 
conditions by the stabilized constraint equations, Eqs. (9) and 
(14), are forced to decrease at the same rate. Hence, the 
technique does not take advantage of physically different time 
constants in order to minimize the errors being accumulated 
in the constraint equations. 


ture of the solution procedures, requiring dynamically varying 
matrix profiles. This can considerably complicate the task of 
software implementation. 

We will now present a new stabilization technique that 
mitigates the three algorithmic and software implementation 
difficulties in Baumgarte’s stabilization technique pointed out 
in the preceding remarks. 

IV. New Technique: Staggered 
Stabilization Procedure 

In Baumgarte’s stabilization technique, as discussed in the 
preceding section, the objective was to minimize the errors 
initiated in the constraint condition 

$-0 ( 21 ) 

First, the difficulty associated with numerically dependent 
constraints alluded to in Remark 2 can be overcome by 
adopting the penalty procedure 

A--U, c-0 (22) 

as the basic constraint equations instead of Eqs. (3) and (4). It 
is noted that the penalty procedure as given by Eq. (22) tacitly 
assumes violations of the constraint condition in actual com- 
putations. If one substitutes Eq. (22) into the governing equa- 
tions of motion, the result becomes 

Mq+jB r *-Q (23) 

It can be shown that this penalty procedure mitigates noncon- 
vergence difficulties in the constraint conditions. However, its 
major drawback is that once an error is committed in comput- 
ing X , there is no compensation scheme by which the drifting 
of the numerical solution can be corrected. It is this observa- 
tion that has led to the development of a staggered stabiliza- 
tion procedure as described in the following paragraphs. 

To illustrate the new procedure we will consider the case of 
nonholonomic constraints. Instead of substituting the penalty 
expression directly into the governing equations of motion, 
first we differentiate Eq. (22) once to obtain 

*-£(*♦$) M 

where we assume the penalty parameter e to be constant. 

Second, we obtain q from Eq. (6a) in the form 

q= B T \) (25) 

and substitute it into Eq. (24) to yield 

— d<$> 

<A + BM 'B t \ = BM l Q+ (26) 


Remark 2 

Note tljat the generalized constraint forces X in Eq. (12) 
exist only when the matrix BM" l B r is not ill-conditioned. 
Even though the constraints are theoretically independent, 
such ill-conditioning can occur when two or more constraints 
become numerically nearly dependent, as B is in general 
stale-dependent. If such situations develop, the accuracy of 
generalized constraint force X can be considerably degraded, 
thus leading to a dramatic loss of solution accuracy for q. 

Remark 3 

From computer implementation considerations, the solution 
of MBD systems by Baumgarte’s technique must be carried 
out in a tightly coupled program module. Therefore, any 
change in the number of constraints impacts the matrix struc- 


Noticc that the homogeneous part of this stabilized equation 
in terms of the generalized constraint forces X has the follow- 
ing companion eigenvalue problem: 

(y + BM~ x B r /t)y-Q (27) 

where { y k , k « 1 . . . m } are the eigenvalues of the homoge- 
neous operator for the new stabilized constraint equations, 
Eqs. (26). Since y A also dictates how the errors in the con- 
straint equations will diminish with time, the errors committed 
in the constraint conditions will decay with their correspond- 
ing different response time constants. This physically oriented 
stabilization property of the present technique is in contrast to 
that of Baumgarte’s technique wherein all the error compo- 
nents diminish according to a single time constant. 
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Third, the new technique enables us to solve for X from the 
stabilized differential equation, Eq. (26). Specifically, we now 
have a set of coupled equations, one for the generalized 
coordinates q and the other for the generalized constraint 
forces X, which are recalled here from Eqs. (6a) and (26) for 
the case of nonholonomic constraints: 


o 

5 

(*] + 
■ . / + 

0 b t 

L° ‘J 


0 bm'b t 



(28) 


Note that these coupled equations directly provide the desired 
differential equations for a conjugate pair of [q XJ. 

Remark 4 

For holonomic constraints, one has several stabilization 
possibilities. The one we have chosen is to integrate the 
governing equations of motion once to obtain 

q" = SM '(Q"- B r \")+h« (29) 

which is substituted into 




b) 

Fig. 1 Control synthesis representation of two stabilization tech- 
niques: a) Baumgarle's technique, and b) stabilized technique. 


X - 




(30) 


to yield 


. / . d) 

c + 8BM 'B r \ n -B(8M '(>"+*;) + -jj- (21) 

Remark 5 

It is observed that even if BM l B r is almost singular, the 
new stabilization technique as derived in Eqs. (26) and (31) 
would not cause numerical difficulty in computing X since the 
solution iteration matrix becomes (c + 8BM l B‘) for nonho- 
lonomic cases and (e + 8 2 BM~ l B T ) for holonomic cases. 

Remark 6 

The present staggered stabilization technique and Baum- 
garte’s technique can be presented in control-synthesis block 
diagrams, as shown in Figs, la and lb. For nonholonomic 
constraints, the present technique can be viewed as a combi- 
nation of gain plus rate feedback stabilization, whereas 
Baumgarte’s technique is seen as a simple gain feedback 
stabilization. For holonomic constraints, a similar distinction 
can be observed. The resulting feature of a rate feedback 
manifested in the present staggered stabilization technique 
constitutes an important attribute as it copes with the dynami- 
cal nature of the problem. 

V. Numerical Evaluation 

The first problem is a one-bar rigid pendulum problem 
studied in Ref. 15. The equations of motion consist of both 
horizontal and vertical trajectories of the pendulum’s tip plus 
one constraint equation for the circular motion of the tip: 
thus, there are two position variables and one holonomic 
constraint condition. First, we fix the integration stepsize and 
carry out the numerical solution by the trapezoidal rule without 
iteration for both stabilization techniques. Figure 2 shows the 
errors in the constraint condition for the two techniques. The 
results show that the present technique yields accuracy about 
two orders of magnitude higher than that yielded by Baum- 
garte’s technique. In order to gain further insight, the accuracy 
level in the constraint condition is fixed to be the same (10 6 ) 
at each time step and the solution matrix is iterated to satisfy 
the accuracy requirement. Figure 3 illustrates the number of 
iterations needed at each step vs time. Note that the average 
iteration number for the present technique is about four, 
whereas with Baumgarte’s technique it is about six. 



Tlm*( SUp Six* = 0.001 ) 

Fig. 2 Errors in constraint with no iteration, performance of two 
stabilized techniques (single pendulum problem). 



Fig. 3 Number of iterations required for given error tolerance, perfor- 
mance of two stabilized techniques (single pendulum problem). 
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The second example is a classical crank mechanism whose 
governing equations of motion are characterized by the fol- 
lowing matrices and constraints [see Eqs. (3-8) for their 
definitions]: 


M- 


J 2 


m 


m J 


(32) 


f rcosfl - ( x - 1 1 cost#)) \ 

$ = l r sm9 — (y — 1 1 sin<#>) > — 0 
(/- /i)sin<> + k / 


(33) 


B 


-rsin0 rcos0 0 

-/jsin0 /[Cost#* ( / — /j ) cos<#> 
-1 0 0 

0 -1 1 


(34) 


and 


q = [ 6 t#> x yj r , A - [Ai A 2 A , j 

£?={0 0 0 -rng) r (35) 


Figure 4 shows the problem definition along with the numeri- 
cal performance of the two procedures, Baumgarte’s technique 
and the staggered stabilization technique. The performance of 
the Baumgarte technique and that of the staggered stabiliza- 
tion technique for this problem are also presented in Fig. 4. In 
carrying out the computations, the trapezoidal rule has been 
used to time- discretize the equations of motion [Eqs. (2)), the 
constraints (Eqs. (3)], and their stabilized forms [Eqs. (19) and 
(28)]. A sufficiently small step increment was used, corre- 
sponding to 82 increments for one cycle of the mechanism, 
with the time increment h - 0.01 for the period 7” -0.82. In 
order to measure the performance of the two techniques 
directly, in terms of violation of the constraint conditions vs 
time during one complete cycle, no iteration was performed at 
each integration step. In each technique, the three constraint 
conditions exhibited the same order of accuracy level. Hence, 
we illustrate only one constraint violation history, i.e., the pin 
joint constraint between the crank and the connecting rod. 
Note that the error in the constraint condition for Baumgarte’s 
technique remains about two digits above that with the 
staggered stabilization technique. In addition, we have experi- 
mented with several values of a and that are required in 
Baumgarte’s technique, and the best parameter choice was 
found to be a — fi — 70. For the staggered stabilization tech- 
nique, the penalty parameter chosen was which 

yielded an accuracy level about 10" 5 for the technique. 

The third problem tested is a simplified version of the 
\cvin-link manipulator deployment problem. 13 The three links 
arc initially folded and, for modeling simplicity, between the 
two joints is a coil spring that resists a constant deploying 
force at the tip of the third link. Also, the left-hand end of the 
firs' link is fixed through the same coil spring to the wall. 
These three coil springs are to be locked up once the links arc 
deployed straight. The deployment sequence of the manipula- 
tor is illustrated in Fig. 5. The time-discretized difference 
equations both for Baumgarte’s technique and the staggered 
Nlabilization technique have been solved at each time incre- 
ment by a Newton-type iterative procedure to meet a specified 
accuracy level. Hence, the performance of the two techniques 
can be assessed by the average number of iterations taken per 
time increment. This is presented in Fig. 6 for the accuracy of 
10 4 . Notice that the staggered stabilization technique re- 
quires on the average about 4.5 iterations per step, whereas 
Baumgarte’s technique requires about 22 iterations per step. 



Fig. 4 Errors in pin-joint constraint with no iteration, performance of 
two techniques. 



0.0 1.5 30 

Horizontal Dimension 

Fig. 5 Deployment of three-link remote manipulator. 


Note that Baumgarte’s technique fails to converge for time, 
t ® 1.1, as manifested in Fig. 6 because the rows in B become 
numerically dependent upon one another when the links are in 
a straight configuration. This corroborates the theoretical pre- 
diction of nonconvergence whenever the solution matrix 
BM *B r for Baumgarte’s technique [see Eq. (12)] becomes 
singular. On the other hand, the staggered stabilization tech- 
nique still converges within 30 iterations because it overcomes 
this singularity difficulty, since X still exists, as can be seen 
from Eqs. (26) and (31). Although not reported here, the same 
relative performance has been observed for different accuracy 
levels, i.c., for the accuracy of 10 5 and 10 6 . 

From the sample test problems, we conclude that the 
staggered stabilization technique yields both improved accu- 
racy over and greater computational robustness than the 
Baumgarte technique. In addition, the staggered stabilization 
technique offers software modularity in that the solution of 
the constraint force \ can be carried out separately from that 
of the generalized displacement q. The only data each solution 
module needs to exchange with the other is a set of vectors, 
plus a common module to generate the gradient matrix of the 
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TIbm 

Fig. 6 Performance of two stabilization techniques for three-link 
manipulator (solution accuracy *■ 10 - *). 

constraints, B. However, one should be cautioned not to 
extrapolate blindly to complex problems the results of the 
present simple examples. Further judicious experiments are 
needed in applying the present staggered stabilization tech- 
nique to complex production -level problems before it can be 
adopted for general applications in multibody dynamic simu- 
lations. 
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An explicit-implicit staggered time-integration procedure is presented for the solution of multibody dynamical 
equations involving large rotations and constraints. The algorithm adopts a two-stage modification of the 
central difference algorithm for integrating the translational coordinates and the angular velocity vector, and the 
midpoint implicit algorithm to solve the kinematical relation in terms of the Euler parameters for updating the 
angular orientations. The Lagrange multipliers to enforce the system constraints are obtained by implicitly 
integrating a parabolically regularized differential equation for the multipliers. The performance of the present 
procedure has been evaluated by applying the procedure to solve several sample problems. The results indicate 
that the procedure Is robust in dealing with a variety of constraints and spatial kinematic motions, hence it is 
recommended for applications to general mullibody dynamics analyses. 


I. Introduction 

C OMPUTER simulation of multibody dynamical (MBD) 
systems has enjoyed substantial progress during the past 
several years. As a result, it is now almost routine to perform 
realistic modeling and assessment of some practical problems 
such as mechanical linkages and manipulations of robotic 
arms . 7 Recently, a new need for the large-scale, real-time 
simulation of flexible MBD systems is emerging primarily in 
support of deployment and construction of large space struc- 
tures in orbit. The development of an MBD simulation soft- 
ware system for space applications must meet several needs, 
which include a versatile data structure for implementation of 
candidate MBD topologies, an automatic derivation of the 
equations of motion, a streamlined incorporation of the sys- 
tem constraints, a robust and efficient direct-time integration 
package, a modular interface with active-control systems, and 
timely visualization of the simulation results. Of these, the 
present paper focuses on a robust and efficient time-integra- 
tion package with parallel/concurrent computers as its pri- 
mary computational environment. 

In general, there have been two types of direct-time integra- 
tion algorithms for the transient response analysis of dynami- 
cal systems: explicit and implicit algorithms. Currently, im- 
plicit algorithms appear to be favored by many MBD 
specialists when both the generalized coordinates and the La- 
grange multipliers are treated as the unknowns. In this, case, 
the corresponding formulations incorporate the system con- 
straints by the penalty augmentation through the Lagrange 
multipliers. It is well known that the resulting Newton-like 
solution matrix is stiff. This has led to implicit time discretiza- 
tion of the constraint-augmented equations and simultaneous 
solution of both the generalized coordinates and the Lagrange 
multipliers . 6,13,15,22,25 

On the other hand, if the constraints are eliminated so as to 
reduce the number of unknowns, it is possible for one to 
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employ either implicit or explicit algorithms. For this situa- 
tion, if the system topology is an open tree, one may invoke 
either a geometric or an algebraic procedure to streamline the 
resulting equations of motion. Geometric procedures rely on 
the use of the incidence matrix 26 and the body-array matrix . 1 1 
Some of the proposed algebraic procedures include singular 
decomposition , 24 the use of generalized speed , 12 the coordi- 
nate partitioning technique , 25 and the so-called order -A/ proce- 
dure . 14 

In developing the present MBD solution procedure, we have 
been guided by the following considerations, which have led to 
the selection of an explicit algorithm. First, the algorithm 
must be robust; experience suggests that explicit algorithms 
remain robust provided computations are stable. Second, the 
algorithm should be easily interfaced with a constraint proces- 
sor as well as an active control synthesizer; the task of inter- 
facing a software module with other software modules be- 
comes easier if its data structure is simple, thus favoring an 
explicit algorithm. To this end, as the central difference inte- 
gration algorithm has been most widely used for the explicit 
transient analysis of structural dynamics problems, we have 
decided to adopt the central difference algorithm as our basic 
integration algorithm. The rest of the paper is organized as 
follows. 

In Sec. II, we introduce basic equations of motion for MBD 
systems. For computational efficiency, the translational coor- 
dinates are expressed in the fixed-inertial frame, whereas the 
rotational coordinates are expressed in the moving body-fixed 
frame in terms of the Euler parameters. Section III introduces 
the partitioning of the governing equations of motion into two 
groups: translational and rotational. Such partitioning paves 
the way for the efficient treatment of the rotational motions 
via the singularity-free Euler parameters, which treatment is a 
major feature of the present paper. 

Section IV introduces the standard form of the central 
difference method for updating both the translational and the 
angular velocities. Once the angular velocities are obtained, 
the angular orientations are updated via the midpoint implicit 
formula employing the Euler parameters; update of the trans- 
lational coordinates is achieved by the central difference 
method. It is shown that the standard form of the central-dif- 
ference method is not applicable to the MBD equations, due to 
the unavailability of the generalized velocity vector at the time 
step at which the acceleration vector is evaluated. To over- 
come this difficulty, a staggered form of the central-difference 
method is developed. 
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To complete the description of the solution procedure for 
constrained MBD systems in Sec. V, the staggered-stabilized 
technique for the solution of the constraint forces as indepen- 
dent variables is summarized from Park and Chiou. 16,17 When 
the two algorithms— namely, the two-stage explicit algorithm 
for the generalized coordinates and the implicit, staggered 
procedure for the constraint Lagrange multipliers— are 
brought together in a staggered manner, they form an explicit- 
implicit staggered procedure. 

Numerical evaluations of the present algorithm are reported 
in Sec. VI. Finally, Sec. VIII discusses several computational 
aspects of the present procedure and summarizes the main 
contributions of the present paper. 

II. Equations of Motion for Multibody Systems 

The discrete equations of motion for flexible multibody 
systems can be expressed as 4 

Mu -f D(u) + 5 (m) + B^\ N + =/(0 0) 

4> ;V (u,w,0 = 0, 4> w (u,/) = 0 (2) 

where M is the mass matrix, D( • ) the generalized velocity-de- 
pendent force operator, S( * ) the internal force operator due 
to member flexibility, B N and B H the gradients of the nonholo- 
nomic and holonomic constraints [Eq. (2)), X N and \ H are the 
corresponding constraint forces, /(f) is the applied force, u is 
the generalized displacement vector, ( ) denotes time differen- 
tiation, and ( ) T designates the matrix transposition. 

The numerical solution of the constrained dynamical system 
governed by Eqs. (1) and (2) consists of two tasks: the satisfac- 
tion of the constraint conditions [Eq. (2)1 to obtain X and the 
computation of the generalized coordinates u from Eq. (1). A 
staggered, stabilized computational procedure to obtain X )V 
and \ H by satisfying Eq. (2) was presented in Park and 
Chiou 16,17 and is summarized in Sec. IV. The major thrust of 
the present paper is therefore devoted to the computation of 
the generalized coordinates u. 

111. Partitioning of the Multibody Dynamical 
Equations 

A basic difficulty in direct integration of Eq. (1) is that u? is 
not directly integrable, except for some special kinematic con- 
figurations, to yield angular orientations. This motivates us to 
partition u into the translational velocity vector d, which is 
directly integrable, and the angular velocity vector w, which is 
not, and to treat them by a partitioned solution proce- 
dure, 5,18 ' 20 viz 



The equations of motion [Eq. (1)] can be rearranged according 
to the preceding partitioning; 



where 

\Qu) = (QAd,d,q, \)1 = (DM + SAd.q) - fl/X] 
lej [qJLuM 7 , X)j [djLu) + SMq) ~ Bj\ j 

in which q is the angular orientation parameter vector, and B d 
and are the partitions of the combined gradient matrices of 
the constraint conditions (2) that are symbolically expressed as 

( 6 ) 


To effect the integration of the rotational degrees of freedom, 
we partition cj further into 

= [_CJ 1 > W 2 ,...,W /> J T (2) 

where is a (3 x 1) angular acceleration vector for theyth 
body, 

( 8 ) 

IV. Staggered Explicit Method for Multibody 
Dynamical Equations 

One of the most popular explicit time integration formulas 
for the solution of the second-order dynamical equations is the 
central difference method, which can be implemented as 

U n+\/2 = + hu n (9a) 

J = u n + hu n+ 1/2 (9b) 

where the superscript n designates the discrete time station 
t = nh and h is the step increment. 

It should be noted that the conventional form of the central 
difference method 

u n + 1 = 2u n - u n “ 1 + h 2 u n (10a) 

U n " = u n + (h /2)(u n + 1 + u n ) (10b) 

is not applicable to the MBD equations since u) cannot in 
general be directly integrated to yield suitable angular orienta- 
tions, let alone unfavorable accuracy problems associated with 
Eq. (10a) as succinctly discussed by Henrici. 8 

A. Integration of the Translational Coordinates 

Assuming that \d n ~ u \d\d\q\\ n \ are given at the time 
steps, t = (n - \/2)h and nh t one can proceed to obtain from 
the partitioned equations of motion [Eq. (4)], the translational 


velocity and coordinates as 

♦ 1/2 = - 1/2 + hM - 1 [fn _ Q (i {d\d\q n X)] (Ha) 

d n +1 =d n + hd n ~' /2 (Hb) 

Note that, due to the intrinsic time-stepping nature of Eqs. (9), 
d n that is needed in computing Q% is not available. This 
difficulty can be overcome if Qd has the form 

Q d = D(jd + SAd.q) - Bd(d,q) (12) 

where D d is a constant diagonal matrix. For this special case, 
one can employ the averaging operator 

DjT = DAl/2)(d n * 1/2 + d n - u2 ) (13) 

so that Eq. (1 la) is modified to 

(M d + 1 /2hD d )d n + ]/2 = (M d - (h /2)D d )d n ' 1 2 

+ /?(/; - SAd\q n )-B n d \ ") ( 14 ) 


If, on the other hand, D d is not diagonal or B d contains d, 
the modification offered in Eq. (14) loses the advantage of the 
central-difference method in that one must either factor the 
matrix \M d + 1/2 hD d ) or iterate on B d . This difficulty is more 
pronounced for updating the angular velocity vector as dis- 
cussed next. 

B. Integration of the Angular Orientations 

One can update the angular velocity vector by Eq. (9a) using 
d from Eq. (4b) 


B — B\ + Bff , 


X = X,v + X// 


- 1/2 + hMJ 1 If: - QJur t q n ,d\\ n ) ] (15) 
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A key feature of the present algorithm is the use of the 
following kinematic relation (e.g., Wittenburg 26 ) to update the 
angular orientations: 

1 F 0 - a> r ' 

<7 = r 

2 \_ui — a) 


I 

q = - A (c o)q 


q = lq o Q\ <72 <7jJ T 


(16) 


that is subject to 
where 

0 
OJl 

— W2 


\ T ■ q = i 


— U)j <jli 

0 — oil 

0 


(17) 


1 = | oj | oh ^ (18) 


Of several procedures tested, we have found the following 
midpoint implicit rule is the most robust and accurate: 


q"+ 1/2 = q n + {h/2q) n * 1/2 

= q n + (h/2)A(u>"*' /2 ) - q nt 1 2 
<7n + i = 2 q n * 1/2 - q", (q n * ') T ■ q"* 1 = 1 
where q n * 12 is obtained by 


(19a) 

(19b) 


q *+\n = 1 /A(/ + (h/4)A(w n+ ul ))q n , 

A = I + (7» 2 /4)(o)f + ojj + (20) 

Finally, once q n +' is computed from Eq. (19), one can 
update the angular orientation matrix /?: 


b = Re, 


[~2(<7o + 9 1 2 ) ~ 1 


R = 


2(<7i<7: ~ <7<x7j) 


2(<7i<72 + QoQy) 
2(<7o + ql) - 1 


2(<?i $3 + Mz) 2(^3 - Mi) 


2(<7i<73 - M:) 
2(<72<?3 + Wi) 
2(^fo + <?3 2 ) “ 1 


( 21 ) 


which relates the body-fixed basis vector, b = [6, b 2 b } J r , 
to the integral-basis vector, e - |_*i *jJ T • 

It should be remarked that the update of the angular orien- 
tation parameters through the kinematical relation [Eq. (16)) 
is in contrast to the conventional algorithm in which one 
substitutes io in Eq. (4b) in terms of q and q by 


(j = 

T(q)q 

+ T(q)q 


-q\ 

<7o 

<73 

— <72 

- qi 

- <73 

<7o 

<7i 

-<73 

<72 

- <7i 

<7o 


and integrates the resulting equations of motion to update q. 

However, computations of + 1/2 by Eq. (15) assume that 
u) n is available for every integration step. Note the £> w (w) in Eq. 
(5) takes for each body the form of 

£>J W ) ~ u)J o) (23) 

where J is the moment of inertia matrix. This term often 
dominates the momentum exchange in multibody systems and 
presents numerical difficulties if w" in D u is approximated by 
a?"-* 172 , leading to inaccurate solutions or numerical instabili- 
ties. 


A linearized computational stability analysis for the al- 
gorithm based on Eqs. (15) and (19), although we do not 
report it here, has been performed when Dj^wP) is approxi- 
mated by DJ,<j) n ~ ,/2 ). The analysis result, as corroborated in 
Sec. VI, shows that such a naive approximation leads to 
unacceptable accuracy loss on outright instability. This has 
motivated us to implement both Eqs. (1 1) and (15) in a two- 
stage time-stepping procedure as detailed next. 

C. Staggered Integration of the Translational and Rotational Coor- 
dinates 

To alleviate the computational and stability issues encoun- 
tered in the single-stage implementation of the central-differ- 
ence method for MBD simulations, the basic algorithm pre- 
sented in the preceding section needs to be modified as 
follows. Specifically, at an arbitrary integration step from 
/ = nh to / = {n + \)h , it is necessary for accuracy and stabil- 
ity that d n and a/ 1 are available for Eqs. (II) and (15), respec- 
tively. Within the algorithmic context of the central difference 
method, this can be accomplished if we stagger the integration 
as follows. 

First, instead of marching from the (rt + 1) to the (n + 2) 
step at the completion of the (n + 1) step, we go back one-half 
step and march a full step from the ( n + 1/2) to (n + 3/2) 


step: 

d n ♦ 1 = d n + h’d(d n * u2 ,d n + l/2 ,q n + l/2 ,X n * ,/2 ) (24a) 

d n ♦ 3/2 _ d n + 1/2 + hd n + 1 (24b) 

for the update of the translational coordinates and 

+ 1 = + hu(u n * w2 ,q n ♦ l/2 ,d n * l/2 ,X" + l/2 ) (25a) 

q n + 1 = (1/A)[/ + (h/4)A (<*) n + l )] • q n + 1/2 (25b) 


g n + 3/2 _ 2q n + 1 — q n + l/2 , (q n * 3/2 ) r * q n + 3/2 — \ (25c) 

for the rotational coordinates. 

For the next integration step, we march from the (n + 1) 
step to the (n + 2) step, and so on, hence the name “two-stage 
staggered explicit procedure.” The net result is that, even 
though we take a full step (h instead of hf 2), we only advance 
half the step at a time. In other words, we evaluate the acceler- 
ation and the angular acceleration vectors twice for each full 
step. 


V. Implicit Solution of Constraint Forces 

The solution of the governing equations of motion is carried 
out by the two-stage explicit integration procedure presented 
in the preceding section. For systems with constraints, one 
must either eliminate the constraints or solve them as part of 
the system unknown. Many MBD systems involve constraints 
that are either difficult or computationally cumbersome to 
eliminate. For this reason, we will adopt the staggered stabi- 
lized procedure, 16, 17 which is reviewed here for convenience. 
First, instead of augmenting Eq. (2) to Eq. (9), and simulta- 
neously solving the generalized coordinates and the Lagrange 
multipliers, we employ a partitioned solution procedure 
to solve the generalized coordinates separately from the La- 
grange multipliers. To effect a partitioned solution of the 
constraints, we introduce the following penalty expression 

X„ = (1 /€)<!>*(«, u.O. \ H = (l/e)$w(«.0. 0<t<\ (26) 

It is noted that the penalty procedure as given by Eq. (26) 
tacitly assumes violations of the constraint condition [Eq. (2)] 
in actual computations. Now, to solve X separately, it is neces- 
sary to cast X in a differential form instead of in the algebraic 
form. This is accomplished as follows. 

Instead of substituting the penalty expression directly into 
the governing equations of motion [Eq. (1)1, first we differen- 
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tiate Eq. (35) once to obtain 

Xyv = (bnu n + (27a) 

X W = - f (bhu„ + (27b) 

where we assume the penalty parameter e to be constant. 

In practice, both the holonomic and the nonholonomic 
constraints may be associated with a common set of general* 
ized coordinates. For such cases, we time-differentiate the 
holonomic constraints and combine those sets of 4>^ into 4>/v in 
Eq. (26). In this way X* and \ H become uncoupled in Eq. (27) 
Let use rewrite Eq. (1) in the form 

ii/v) _ M N 0 
u H ) .0 M h 

where p is a generalized momenta 

p = D{ii) + S(u) (29) 

so that, upon substituting Eq. (28a) into Eq. (27a) for the 
nonholonomic case, one obtains 

, d<t> v 

t\ s + B.\Mm 'B£\ n = r kw = flvMC 1 (f y - p s ) + (30) 

For the holonomic case, we integrate u H once by the midpoint 
implicit formula [see e.g., (Eq. (19a)] to obtain 


\/n-P n -Bn*n 
[Jh - pH Bh^H. 


(28) 


been implemented in two separate integration modules: gener- 
alized coordinate integrator (CINT) and Lagrange multiplier 
solver (LINT). The CINT employs a two-stage modified form 
of the central-difference method for updating the angular 
velocity vector and the midpoint-implicit rule for updating the 
angular orientations via the Euler parameters. The Lagrange 
multiplier solver adopts a staggered form of the midpoint 
implicit method. It should be noted that CINT needs the 
constraint force vector, viz, f\ = B T \, as an applied force 
from LINT. Similarly, LINT needs the generalized coordi- 
nates and their time derivatives from CINT. Hence, the step 
advancing of the present procedure is accomplished in a stag- 
gered manner. 

The module LINT receives/" = B from LINT and ad- 
vances the solution of the MBD equation [(1) or (4)] from time 
t n to f" f '. Once (d, d» w, q) are available at time t n + 1 from 
CINT, LINT computes the Lagrange multipliers from Eq. 
(34). 

To complete the solution of both the generalized coordi- 
nates and the Lagrange multipliers, we invoke the following 
sequence calls: 

/ = r 

Call CINT {p\ g\ h>p” + ') 

Call LINT (P + 1/2 , h , X" + ,/2 , A" + wl ) 
t = t n + h/2 {n — n + 1/2) 

Call CINT {p n * l/2 , g n + 1/2 , h , p n + 3/2 ) 

Call LINT (P + l , h y X- + , ,/T +l ) 
t = t n + h 

where 


= u% + (h /2)ii%* 1/2 


= u 


n 

N 


+ '{fn* 1/2 -PIC 1/2 - Bj,\1C l/2 ) 

Substituting Eq. (31) into the Eq. (27b), we obtain 

fKC 1,2 + \ B„M,7 'BMC 1/2 = r >,< 


Bn 


u%+-M H -'{f^ x/2 )-p n H ' ,/2 ) 


dt 


Equations (30) and (32) can be written as 
e\ + BM = r K 


(31) 


(32) 


(33) 


pn = (d n ' ul , d\ w A, - l/2 , q n ) 

g * = w\n = Bi\«)\ 

+ 1/2 = + l/2 # rfn ♦ \/2, ♦ 1/2^ q n + \fl* X”) 

In summary, the present procedure requires two function 
evaluations and two X-solutions per each full step, hence the 
name “explicit-implicit staggered procedure.” We now pre- 
sent three sample problems whose efficient and accurate solu- 
tions will confirm in their combined totality not only the 
viability of the present integration procedure for the solution 
of the multibody equations of motion with or without con- 
straints, but also the constraint stabilization procedure. 


Integration of the preceding equation by the midpoint implicit 
rule yields the following difference equation: 

(e/ + {h/ 4) BM ~ ^OX"* 1/4 = (/i/4)(r£ + 1/2 + rj?) + e\ n 

(34a) 

X" * 1/2 = 2X n + 1/4 - X„ (34b) 

It has been shown that the preceding staggered, stabilized 
procedure for the solution of the constraints offers not only a 
modular software package to treat the constraints but also 
yields more robust solutions compared to the techniques pro- 
posed by Baumgarte. 2,3 In particular, even when BM [ B f 
becomes nearly singular, the staggered stabilized procedure 
[Eq. (34)] gives stable and acceptable solutions, whereas the 
constraint forces computed by the Baumgarte’s technique di- 
verge. 

The present explicit-implicit, staggered procedure given by 
Eqs. (11), (15), and (19) together with the constraint solver 
[Eq. (9)] constitutes a complete solution procedure for a multi- 
body dynamics analysis for systems with constraints that un- 
dergo large motions. 

VI. Computer Implementation and 
Performance Evaluations 

The preceding procedure for the numerical integration of 
the equations of motion for constrained MBD systems has 


A. Dynamics of a Rowling Hall 

This problem was investigated by Huston cl a!., 10 however, 
their equations do not involve the constraint force X. In the 
present analysis, we employ a formulation that incorporates 
the constraint force as part of the system variables. Figure 1 
illustrates the ball, with its radius a and an offset center r 0 that 
is to follow a sine curve, 

<£=>’— siruc = 0 (35) 

<*3 



Kig. 1 Solid spherical ball rolling on a flat surface. 



566 


PARK, CHIOU, AND DOWNER 


J. GUIDANCE 


Table I Physical dimensions and initial conditions for 
a rolling sphere 

m =71.32 N, a = 10.9 cm, r 0 = 0or0.15cm 
J\ =J 2 ~ = 2/5 ma 2 , f =10" 6 

x°=y° = 0, = “ w? — 1 , 0J3 = 0 j 

*° = / = ao;°. [«/®= 1. 



Fig. 2 Ball track projected on three-dimensional sphere surface. 


The various matrices and vector quantities for Eqs. (26) 
and (35) can be derived as 


Af = 


m 0 
0 m 

sym. 


- mr&\ • b 2 mr&\ * b\ 0 

— mro^2 ’ ^2 mr oC2'b\ 0 

Ji ° 

h 


(36a) 


B = 


0 — ab\ ■ e 2 -ab 2 e 2 ~ab^e 2 


' 1 

0 1 ab } * ab 2 • e , ab } • e x 

cos* - 1 0 0 0 


((j)xU)y€\ • b\ + (jJ 2 03}€ 1 ' b 2 “ (W| + ‘ ^3 

Fd = “ mr ° [a>|OJ 3 <?2 • b\ + u) 2 u)^e 2 * b 2 - (cj? + u 2 )e 2 • b 


(36b) 


(37) 




0>2 W 3("^2 — /3) 
u)30>i(./j “ J\) 
(i3\(i) 2 {J l — ^ 2 ) 


/rf = 0, /„ = mgr 0 \ 


e^b 2 
-e y -b x \ 
0 


(38) 


d - i*,^ L w i x - L x » Xz X> J 

(39) 

where the inertial-basis vector <? and the corotational-basis 
vector b are related according to 

b = Re (40) 


There is a total of eight variables to describe the equations 
of motion for the constrainted ball. However, in adopting the 
present solution procedure— viz, Eqs. (1-3)— we solve for nine 
variables as we employ the four Euler parameters for angular 
orientations. 

Numerical solutions of the rolling of a sphere on a flat 
sinusoidal curve have been obtained with the data summarized 
in Table 1 . 

The ball track that follows the constraint sinusoidal curve 
[Eq. (26)] is projected on the ball itself as shown in Fig. 2, with 
the corresponding angular velocities in Fig. 3. The time histo- 
ries of the three constraint forces are shown in Fig. 4, where X, 
and \ 2 correspond to the x and y components of the constraint 
forces in order to maintain the rolling-contact condition, and 
X 3 corresponds to the constraint force to maintain the sinu- 
soidal trajectory as imposed by Eq. (26). Hence, the first two 
constraints are indicative of the skidding phenomenon, 
whereas the third corresponds to the steering force required to 
maneuver the ball. Notice that, although periodic, they exhibit 

highly nonlinear behavior. ... 

We have performed convergence studies with increasing 
step sizes; these indicate that the present two-stage staggered 
explicit procedure— viz, Eqs. (1-3)— maintains both the solu- 
tion accuracy and stability for the step size up to h ^0.15. 

Figure 5 shows the angular velocities for a ball with an 
offset center (r„ = 0/1 5a). Note that the angular velocities no 
longer exhibit periodic response, whereas for the no-offset 
case they are periodic (see Fig. 3). Likewise, the steering force 
causing the ball to follow the sinusoidal curve O' = s.nx) be- 
comes highly nonlinear (see Fig. 6) although it is nonlinear y 
periodic. The x- and y-direction contact forces, which main- 
tain the rolling-contact condition between the ball and the 



Fig. 3 Angular velocities of the sphere with no offset. 



Timc(step size - 0.01) 

Fig. 4 Time histories of three constraint forces on the rolling sphere. 
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Timc(stcp iizc = 0.01) 


Fig. 5 Angular velocities of the sphere with offset center. 



Fig. 6 Time histories of three constraint forces with offset center. 



Fig. 7 Convergence studies on present and conventional procedure. 


surface, although bounded, manifest extremely nonlinear be- 
havior. 

To corroborate the instability of a naive approximation of 
u) n x u n ~ 1/2 for the computation of D „ in computing w"— as 
alluded to in Sec. I V.B, the equations of motion for the rolling 
ball have been integrated by the following formula: 

u rt+ 1/2 * c/ 1 ' 1/2 - h\i~ '[/*” + l/2 , d n )\ n - W' ,/2 )] 

(41) 


Figure 7 shows u 2 vs time for the converged solution, the 
present two-stage, explicit-implicit, staggered procedure with 
step size ( h = 0.2), and the conventional procedure with step 
size (h = 0.2). The diverging solution by the conventional 
procedure is clearly manifested, thus confirming the instability 
of the conventional procedure. On the other hand, the present 
staggered procedure faithfully traces the converged solution. 

Finally, the solution accuracy vs the step size has been 
assessed for the offset center ball with different step sizes. 
Figure 8 represents the performance of the present procedure 
for different step sizes. Note that if one chooses the step size 
that corresponds to more than 15 samples per period, viz., 
h < 0.2, a reasonable engineering accuracy can be maintained. 
Although not reported herein, the problem was also solved by 
the trapezoidal rule. For h > 0.1, the computational overhead 
with the trapezoidal rule was an order of magnitude higher 
than by the present two-stage staggered explicit-implicit proce- 
dure without an accuracy improvement. Our experience with 
the example problem indicates that the present computational 
procedure for handling large rotational and translational mo- 
tions with constraints is robust and efficient. It is important to 
note that the present procedure accurately traces not only the 
angular motions but, more important, the constraint forces 
and the four Euler parameters (although these are not pre- 
sented here). 


2.0 — 

step size = 0.01 

step size = 0,2 


step size = 0.4 



- 2 .0 1 1 1 1 1 1 * 1 — - : 1 

0.0 20.0 40.0 

Time 

Fig. 8 Accuracy comparison on angular velocity u>i for three dif- 
ferent step sizes. 



Fig. 9 Double pendulum with spatial joints. 
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Fig. 10a Trajectories of double pendulum on X~Z plane. 



-1.5 0.0 1.5 

X 

Fig. 10b Trajectories of double pendulum on X-Y plane. 


B. Three-Dimensional Double Pendulum 

The second problem with which we have tested the present 
procedure is a spatially moving double pendulum as shown in 
Fig. 9. The governing equations of motion become those of 
two separate rigid bars, except that they are connected by two 
spherical joints. From Fig. 9 wc have the following quantities: 

<t>' = d l - Vi u/ x z ' = 0, i = l ,2 (42) 


M - diag j m 1 , m 2 t J 1 j 


B = 


/ 

/ 


ViV x 0 

- Viz 1 x - / 


Ft*) - 



0 

- Viz 2 x 


(43) 

(44) 


/'= ~ 


0 

0 

0 

“ *A) 

W3C0|(^J J\) 

— Jz) 


/ = 1, 2 (45) 


u‘ ~ j\ 

i, «)' d { = [x, y, z] T , oj' = [on. <j 2 , w 3 ] r 

(46) 


X = [X|, X 2 , X 3 , X 4 , X5, X*] r 

(47) 


In the preceding equations, Viz is the vectorial distance 
from the center of the bar to the spherical joint constraints, m 
and J are the three translational and rotatory-inertia matrices, 
Z is the skew symmetric matrix formed by the three compo- 
nents of z, x implies a vector cross multiplication, and the 
superscript designates the / th bar. 

The pendulum is originally positioned in a gravity field with 
initial horizontal angular velocities (oA l) = wl 2) = 1). Figure 10 
shows the spatial trajectories of the two mass centers as pro- 
jected on the horizontal surface and on the vertical plane. It is 
noted that the two trajectories form a similar pattern. The 
constraint forces and angular velocities, although not reported 
here, exhibit patterns that are analogous in their characteris- 
tics for the two joints and two mass centers, respectively. 

We have performed convergence studies by using different 
step sizes h. Numerical evaluations indicate, as with the 
rolling-ball problem, that when the step-size samples are more 
than 20/period, the present procedure yields both good accu- 
racy and stability. 


C. Closed Four-Bar Linkage 


The final problem is a simple closed four-bar linkage, com- 
posed of four individual bars connected with five spherical 
joints (see Fig. II). The governing equations of motion for 
this problem are identical with those of the previous section, 
except that the gradient of the constraint equations B is given 
by 


B = 


Bl 

Bl 


B} 

B} 


0 


0 

Bl 

Bl Bt 

Bl 


(48) 


where 

«; = (/, '/a x j, = [/. - Vii x J 

B{ = [ - /, - Vi i x j (49) 



Fig. 11a Initial configuration of the closed four-bar linkage. 
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0.0 5.0 100 

Time 

Fig. 12a Angular velocities of the closed four-bar linkage. 



0.0 5.0 10.0 


Time 

Fig. 12b Constraint forces of the closed four-bar linkage. 


The body-fixed coordinates and constraint conditions for 
this problem have adopted the same procedure as in the pre- 
ceding pendulum problem. To trigger large rotational mo- 
tions, two vertical forces (/ r J l> = F J 2 * = 1) are applied at the 
center of mass of the first and fourth bar (see Fig. 1 la). Figure 
I lb indicates the motion of each bar for 8 s run time. Note 
that the trajectories of each joint can also be seen from the 
plot. Because of the symmetry of the geometry and the applied 
forces, one should expect corresponding symmetries between 
the angular velocities of the first bar compared with those of 
the fourth bar, and so on (see Fig. 12a). This is also the case 
with the constraint forces as manifested in Fig. 12b. 

We investigated numerical solutions for different step sizes 
h . The results show that when step size h is less than 0.075, the 
procedure proposed here maintains stability with acceptable 
accuracy. 


VII. Discussion 

In this paper, we have presented a computational procedure 
for direct integration of the MBD equations with constraints. 
Because of its step-advancing nature, the procedure is labeled 
an explicit-implicit staggered algorithm explicit for solving the 
CINT and implicit for Lagrange multipliers to incorporate 
constraints (LINT). The present generalized coordinate solver 
(CINT) carries out its task in a partitioned manner in which 
the translational motions are integrated separately from those 
of the rotational parameters. 

Numerical experiments reported herein and additional ap- 
plications investigated so far indicate that the present proce- 
dure yields robust solutions if the step size gives more than 20 
samples for the period of the apparent highest response fre- 
quency of a given multibody system. 21 Hence, the present 
procedure appears to have accomplished the following. 


Because of the modular implementation of the present 
MBD solution procedure, the task of interfacing the present 
MBD solution modules with additional capabilities such as 
active controller, observer, and other analysis and design soft- 
ware modules becomes relatively straightforward. Such soft- 
ware architecture is in contrast to most existing programming 
practice in which several analysis capabilities are embedded 
into a single monolithic program. 

For closed-loop multibody systems and/or problems with 
complex topology, in which it is impractical and inadvisable to 
eliminate the constraints, the present procedure facilitates a 
straightforward construction of the governing equations of 
motion with appropriate constraints. The generalized coordi- 
nates and the Lagrange multipliers can then be solved in a 
partitioned manner. 

The update of angular orientations is based on the Euler 
parameters by adopting the midpoint implicit formula. This 
avoids potential computational complications, as the angular 
orientation matrices remain singularity free. 

Application of the present procedure to flexible multibody 
systems is currently being carried out, and preliminary results 
are quite encouraging. We hope to report in the near future on 
results with flexible-body dynamics as well as on results with 
large-scale multibody problems. 

Finally, a preliminary stability analysis of the present proce- 
dure, although not reported here, has been conducted. The 
analysis results indicate that the procedure is stable provided 
the step size satisfies 

h < 2/(ojj + (j&2 "h w5) 1/2 (50) 

A separate article on the stability issue is presently under 
preparation; we plan to publish it in the near future. 
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Abstract. This paper reports on our experience in solving large-scale finite ele- 
ment transient problems on the Connection Machine. We begin with an overview 
of this massively parallel processor and emphasize the features which are most 
relevant to finite element computations. These include virtual processors, par- 
allel disk I/O and parallel scientific visualization capabilities. We introduce a 
distributed data structure and discuss a strategy for mapping thousands of pro- 
cessors onto a discretized structure. The combination of the parallel data struc- 
ture with the virtual processor mapping algorithm is shown to play a pivotal role 
in efficiently achieving massively parallel explicit computations on irregular and 
hybrid two- and three-dimensional finite element meshes. The finite element ker- 
nels written in C* have run with success to solve several examples of linear and 
nonlinear dynamic simulations of large problem sizes. From these example runs, 
we have been able to assess in detail their performance on the Connection Ma- 
chine. We show that mesh irregularities induce an MIMD (Multiple Instruction 
Multiple Data) style of programming which impacts negatively the performance 
of this SIMD (Single Instruction Multiple Data) machine. Finally, we address 
some important theoretical and implementational issues that will materially ad- 
vance the application ranges of finite element computations on this highly parallel 
processor. 


I. INTRODUCTION 

Parallel computers are having a profound impact on computational mechanics. 
This is reflected by the continuously increasing number of publications on finite 
elements and parallel processing. Not only have some computational strategies 
been re-designed for implementation on commercially available multiprocessors, 
but also some innovative algorithms have been spurred by the advent of these new 
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machines. However, many of the reported parallel finite element simulations have 
been on systems with a few processors. Examples of these systems are Intel’s iPSC 
with 32 processors (reported by Farhat and Wilson [l]), JPL/Caltech s hypcrcube 
with 32 processors (Lyzenga, Raefsky and Ilager [2], and Nour-Omid and Park 
[33]), Alliant’s FX8 model with 8 processors (Belytschko and Gilbersten [3], and 
Farhat and Crivelli [4]), and CRAY’s systems with up to 4 processors (Benten, 
Farhat and Jordan [5]). (For more complete lists of references on this topic see 
White and Abel [6] and Noor [7].) While great speed-ups were measured on these 
coarse to medium grain machines, Farhat [8] has shown that traditional vector 
supercomputers could not be outperformed in finite element simulations (except 
of course on systems which connect more than one vector superprocessor, such 
as the CRAY X-MP and CRAY-2 systems, each of which has 4). 

Recently, massively parallel machines have demonstrated their potential to 
be the fastest supercomputers, a trend that may accelerate in the future. While 
solving the shallow water equations, McBryan has reported that the Connection 
Machine (CM_2 in the sequel) (65536 processors) was three times faster than the 
four-processor CRAY X-MP [9]. Gustafson, Montry and Benner have developed 
highly parallel solutions for baffled surface wave equations, unstable fluid flow 
and beam strain analysis, and have reported performances on NCUBE’s 1024- 
processor hypercube which are close to those of vector supercomputers [10]. 

The objective of the present study has been: first, to evaluate the multipro- 
cessing features of the CM-2 that are relevant to finite element computations, 
second, to develop a suitable finite element data structure which exploits the 
system architecture, third, to implement a decomposition/mapping procedure 
that matches as far as possible the layout of the processors to the finite element 
meshes, and fourth, to assess those implications of finite element analysis on the 
CM_2 that should be considered in the design of future massively parallel pro- 
cessors. Hence, we focus primarily on implementational issues that are critical 
for the full exploration of the multiprocessing capabilities of the CM_2, and only 
secondarily on solution algorithms, as far as they impact the present study on 
implementational issues. 

The finite element equations of motion for structural systems can be ex- 
pressed as: 


Md + F’"(d, d) = F cx (1) 

where M denotes the positive definite lumped mass matrix, F ,n and F' 51 denote 
the internal and external force vectors, and d, d and d denote respectively the 
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acceleration, velocity and displacement vectors. In the linear case, the internal 
force vector becomes: 


F’ n = Dd +Kd (2) 

where D and K are the damping and stiffness matrices respectively, which are 
positive semi definite. In this work, an eventual damping is assumed to be pro- 
portional to the mass and stiffness. 

The algorithmic nature of a candidate solution method for the structural 
dynamics equation (l) can significantly influence the software requirements, data 
communications and arithmetic efficiency. As our main focus is on implementa- 
tional issues rather than algorithmic ones, we have decided on a simple explicit 
time integration procedure. Hence, we choose to integrate equation (t) with the 
fixed step explicit central difference algorithm because (a) it is inherently parallel, 
and (b) it has the largest undamped stability limit among second-order accurate 
explicit linear multistep algorithms, as has been demonstrated by Krieg [llj and 
Park [12]. In our context, it is expressed as: 

d n+1/2 = (T ~ 1/2 + hM- l (F ca (r) -F in (d n ,d n )) 
d n+1 = <r + /ld” +1/2 

where h is the fixed time step and the superscript n indicates the value at the 
discrete time t n . 

The remainder of this paper deals with the massively parallel solution of (1) 
using (3), and is organized as follows. In Section II, we give an overview of the 
CM_2 hardware configuration and empasize those features which are pertinent 
to finite element computations. In particular, we address issues that are related 
to the processor memory size, to the SIMD architecture, and to the fast inter- 
processor communication package, the NEWS grid. In Section III, we discuss 
the floating point arithmetic performance of the CM_2 and highlight its current 
dependence on the selected language compiler. Algebraic manipulations coded 
in *Lisp are shown to be three times as fast as when written in C*. A general 
purpose finite element distributed data structure is presented in Section IV. De- 
signed originally to handle massively parallel finite element explicit computations 
on irregular and hybrid meshes, this parallel data structure is also very efficient 
for parallel I/O manipulations and parallel graphic animation. Since the often- 
encountered mesh irregularities inhibit the use of the NEWS grid communication 
package, we discuss in Section V an alternative decomposition/mapping strategy. 
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The decomposition technique is designed to minimize both the amount of com- 
munication between different chips and the amount of wire contention within a 
chip. The mapping algorithm attempts to reduce the distance that information 
must travel. Section VI summarizes the overall organization of the massively 
parallel transient simulation. In Section VII, our parallel data structure and pro- 
cessor mapping are applied to (3) for the solution of various large-scale transient 
problems. Measured performances are analyzed in detail. Mesh irregularities 
are shown to be the source of several factors which considerably slow down the 
machine. Finally, in Section VIII, we address some important theoretical and im- 
plementational issues that will materially advance the application ranges of finite 
element computations on the CM-2. In particular, we note that time integration 
numerical algorithms such as explicit finite differences and equation solvers such 
as the preconditioned conjugate gradient are implemented using the same paral- 
lel data structure and mapping algorithm which are presented in this paper. We 
compare the substructuring technique and the virtual processor approach, and 
comment on the implications of implicit algorithms for the effective use of the 
CM-2. 

II. THE CONNECTION MACHINE HARDWARE ARCHITECTURE 

Here we present an overview of the CM-2 system organization and discuss issues 
that are pertinent to massively parallel finite element computations. See Ilillis 
(13] for an indepth discussion on the rationale behind the CM_1 (a previous model 
of the Connection Machine), the Technical Summary of Thinking Machines Cor- 
poration [14] for further architectural information, and McBryan [9] for initial 
studies of scientific computations on the CM_1. For the sake of clarity, we sum- 
marize the architectural features before discussing their impact on finite element 
simulations. 


H.l. System Organization 

IL1.1 CM-2: The Parallel Processing Unit 

The CM_2 is a cube 1.5 meters on a side, made of up to eight subcubes (fig. 
1). Each subcube contains 512 chips and every chip includes 16 bit serial pro- 
cessors which are connected by a switch. Each individual processor has 64 Kbits 
(8 Kbytes) of bit-addressable local memory and an arithmetic-logic unit (ALU) 
that can operate on variable- length operands. Every two chips may share an op- 
tional Weitek floating point accelerator chip. A fully configured CM-2 thus has 
4096 (2 12 ) chips, 2048 floating point accelerator chips, 65536 processors, and 512 
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Mbytes of memory. The chips are arranged in a 12 dimensional hypercube. A 
chip * is directly connected to 12 other chips j, with the binary representation of 
* and j differing only by 1 bit. 
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The CM-2 system provides two forms of communication between the processors: 

• a general mechanism known as the router which allows any processor to 
communicate with any other processor. Each CM_2 chip contains one router 
node t which serves the 16 processors on the chip, numbered 16t through 
16t + 15. The router nodes on all the chips are wired together in a 12- 
dimensional boolean cube and together form the complete router network 
(fig. 2). For example, suppose that processor 117 (processor 5 on router 
node 7), has a message to send to processor 361 (processor 9 on router node 
22). Since 22 = 7 + 2 4 — 2°, router 7 forwards the message to router 6 
(6 = 7- 2°) which forwards it to router 22 (6 + 2 4 ), which delivers the 
message to processor 361. 

• a more structured and somewhat faster communication mechanism called 
the NEWS grid . Each processor is wired to its four nearest neighbors in a 
two-dimensional rectangular grid (fig. 3). Communication on the NEWS 
grid is extremely fast and recommended whenever it is possible. 

An important practical feature of the CM_2 is the support for virtual pro- 
cessors. When the CM.2 is initialized for a run, the number of virtual processors 
(vp in the sequel) may be specified. If it exceeds the number of available physical 
processors, then the local memory of each processor is split up into a number of 
regions equal to the ratio between the number of vps and the number of physical 
processors. Automatically, for every Paris (PARallel Instruction Set) instruction, 
the processors are time-sliced among the regions. If a physical processor is sim- 
ulating N vps, each Paris instruction is decoded by the sequencer (as explained 
below) only once for N executions. This results in an enhanced user performance. 
Also, the use of a vp > 1 allows the pipelining of floating point operations in the 
Weitek chips, which provides an additional enhancement to machine performance. 

The CM_2 is an SIMD machine. All processors must execute identical in- 
structions or some processors may choose to ignore any instruction. Consequently, 
an instruction which involves a nested binary branch can see its exection time in- 
creased by a factor of two. The SIMD nature of the CM-2 has some disadvantages 
in finite element computations, as will be shown. 
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FIG. 2. The Router Network 
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FIG. 3. The NEWS Grid 
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II. 1.2 The Front End Computer 


The parallel processing unit described above is designed to operate under the 
programmed control of a front-end computer (FE in the sequel) which may be 
either a Symbolics 3600 Lisp Machine or a DEC VAX 8000 series computer. The 
FE provides the program development and execution environment. It transmits 
instructions and associated data to the CM_2. Instructions from Paris are not 
handled directly by the CM_2. After they are issued from the FE, they are 
processed by a sequence r which broadcasts them to the CM_2 in the form of low 
level operations. 

II.l.S The Data Vault System 

I/O has traditionally been the Achilles heel of computers and supercomputers. 
Moreover, it is very well known that I/O manipulations can easily dominate the 
execution time of a finite element code. The CM_2 I/O system appears to offer 
hope for the solution of this problem. 

The Data Vault is the CM-2 mass storage system. Each Data Vault unit is 
associated with one eighth of a fully configured CM-2. It stores its data in an 
array of 39 individual disk drives. With this disk farming system, the concept of 
performing parallel I/O is carried through: instead of regarding a file as a serial 
stream of bits, the CM-2 file system regards it as many streams of bits, which are 
read or written in parallel, one stream per processor. When eight Data Vaults 
operate in parallel, they offer a combined data transfer rate of 320 mbytes per 
second and hold up to 80 gigabytes of data. 

IL1.4 The Graphic Display System 

The CM-2 graphic display system known as the Frame Buffer also incorporates 
the concept of parallelism. It allows the user to visualize on a color monitor 
screen the data in the processors. The display can be updated as computations are 
performed. We have found this tool very useful, not only for real-time animations, 
but also for debugging purposes. 

The system organization of a CM_2 is summarized in figure 4. 
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FIG. 4. System Organization of a CM_2 
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n.2. Impact on Finite Element Computations 

It is well-known that the solution algorithm (3) can be implemented using only 
element-level computations. Hence, if each vp of the CM_2 is mapped onto one 
finite element, equation (l) can be efficiently integrated in parallel. The rationale 
behind this proccssor-to-element assignment will be analyzed in Sections IV and 
VIII. Here, we discuss the direct impact of the CM.2 hardware on such a decision. 

The Local Memory and Element Level Computations 

Consider the 9-node curved shell element shown in figure 5. 



FIG. 5. A 9-Node Shell Element 


Three displacements and two rotations are attributed to each node, which 
amounts to a total of 45 degrees of freedom per element. Consequently, the 
symmetric part of the elemental stiffness matrix, contains 45*(45 + l)/2 = 

1035 words. If double precision is used, the storage of amounts to 1035*64 
= 66240 bits, which exceeds the 65536 bits that are available on a single CM_2 
processor. On the other hand, if single precision is used, the storage of ^ 
requires 33120 bits, so that 32416 bits are left for the storage of the vectors d (c) , 

d^, the elemental lumped mass vector M^, and the forces F cz ^ and F’ n ^ \ 
However, even in the latter case, only a vp ratio of 1 can be used. This limits the 
size of the finite element mesh to the maximum number of processors available on 
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the CM.2 at hand. Also, it inhibits further performance enhancement as outlined 
in Section II. 1. 

Fortunately, in our case the above storage requirements can be considerably 
decreased. The nature of explicit computations is such that F (<I ) can be 
directly computed from the displacements at t n and the stress-strain constitutive 
equation. As a result, the solution process defined in (3) involves only vector 
quantities which do not require a large amount of storage, so that vp ratios 
between 1 and 4 are possible. However, the reader should keep in mind that the 
current local memory size of a CM_2 processor may penalize sophisticated high 
order elements and implicit finite element algorithms in general. This restriction 
is not encountered on other commercially available hypercubes such as iPSC, 
NCUBE and AMETEK among others. 

The NEWS Grid and Finite Element Patches 

Consider the regular finite element mesh shown in figure 6. Except on the bound- 
aries, each element is connected in the same pattern to exactly eight other ele- 
ments. Consequently, during the explicit time integration algorithm, each pro- 
cessor communicates with its neighbors in the same manner. Interprocessor com- 
munication can be performed with a two step mechanism (fis- 
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FIG. O. A Two Step NEWS Mechanism on a Regular Mesh 


However, the beauty of the finite element method resides in the fact that it solves 
models with irregular meshes. Typically, a finite clement mesh consists of several 
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patches which are connected together using irregular transition regions (fig. 7). 
For these often encountered cases, the NEWS grid becomes impractical. Rather, 
the router has to be utilized. In Section IV, we describe how a distributed data 
structure can guide the router during this process. 



FIG. 7. Transition Zones 


SIMD Hardware vs. MIMD Finite Element Computations 

Typical finite element meshes comprise more than one type of element. Con- 
sider the case where a discretized region is modeled with shell elements that are 
stiffened with beam elements. Clearly, the instructions associated with the shell 
elements differ from those associated with the beam elements. Consequently, the 
vps which are assigned to shell elements and the vps which are assigned to beam 
elements cannot execute their segments of code in parallel; for example, the beam 
processors have to execute first, then the shell processors. If T b and T, denote 
the execution times associated with the instructions for a beam and a shell el- 
ement respectively, the total elapsed parallel time for a single instruction over 
the set (beams + shells) on an SIMD multiprocessor is Tt, + T„. On an MIMD 
multiprocessor, this elapsed parallel time is max{Tj,-\-T t ). Similar situations arise 
when during the loading some elements turn to be materially nonlinear and some 
remain linear. In this case, one should always compute the linear component 
of the response (the elastic stiffness for example) before attempting to test the 
yielding criterion. However, in spite of these disadvantages SIMD programs can 
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still be attractive, because they tend to be easier to debug and rarely suffer from 
the synchronization errors which are typical of MIMD codes. 

Parallel I/O in Finite Element Computations 

At each time step, the computed displacements, velocities, accelerations as well 
as strains and stresses need to be stored on disks. This represents a significant 
amount of I/O traffic. It has been our experience that the CM_2 Data Vault 
system is efficient at reducing the corresponding elapsed time (see Section VII). 

Real-time Graphic Animations 

The massively parallel real-time animation of the mesh deformations is a direct 
consequence of the availability of the Frame Buffer and the decision of assigning 
a vp to a finite element. At each time step, after the node displacements are 
found all of the vps concurrently draw the outline of their assigned elements on 
the graphic screen. The result is a real-time finite element animation. 


III. BENCHMARKING THE CM-2 

At the time of writing this paper, the CM_2 supports three high level lan- 
guages: C* (pronounced see-star), *Lisp (pronounced star-lisp), and CM-Lisp 
(pronounced see-m-lisp). The first two are extensions of C and Lisp respectively. 
Paris is somewhat the assembly language of this parallel processor. 

In this section we comment on the results of a set of timing experiments 
that were carried out on the CM-2 of the Center for Applied Parallel Processing 
(CAPP), at the University of Colorado, Boulder. Since only one eighth of a cube 
was available on this system, all results were obtained using 8192 processors. 
McBryan [9j has shown that all results demonstrated on subcubes of the CM_2 
scale essentially linearly to the 65526 processor system. Consequently, throughout 
this paper, megaflop rates are reported after they are linearly scaled to the full 
configuration. These experiments provided us with. 

• a reference performance for the evaluation of our approach to massively 
parallel finite element explicit computations. 

• the influence of the vp ratio and that of the high level language compiler 
on attainable performances. At this point, we remind the reader that, if 
an application requires an amount of local memory (per processor) m rt , the 
highest vp ratio possible is equal to the closest power of two to the ratio 
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between the maximum amount of local memory available on the machine 
(currently 8 Kbytes), and m a . 

Table 1 reports the megaflop rates for some scientific computations on the 
CM_2 at different vp ratios. All statements were written in C*. Each statement is 
performed by each processor on its variables. All variables were declared parallel 
(local) and float (simple precision), except variable dp which was declared mono 
(serial) float, and variable i which was declared mono integer. Timings were 
measured using the cmtimer routines. Each operation or operation was 
counted as one flop. 

TABLE 1. Megaflop Rates Using C* 


Parallel Processor = 

CM_2 

- Language = 

C* - Variable = 

= float 


Statement 

Vp Ratio 


1 

2 

■ 

8 

16 

32 

64 

128 

256 

y[/]+=a*jc[il 

740 

808 

848 

850 

880 

- 

- 

- 

- 

y - y +a*x 

569 

654 

699 

728 

743 

761 

778 

791 

800 

2 =x*y 

409 

485 

535 

569 

579 

585 

600 

610 

623 

dp += x*y 

202 

359 

583 

839 

1075 

1240 

1348 

1400 

1500 


Based on these results, we have observed the following: 

1. Floating point performance is enhanced at higher vp ratios. This is due to 
the fact that for vp ratios greater than one, computations in the Weitck chip 
are pipelined. 

2. vector saxpys are not slower than scalar ones. This is because memory 
addresses are computed on the front end. The additional speed noticed for 
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vector saxpys is thought to be due to the overlapping of addressing and 
floating point computations. 

3. C* appears to handle poly (parallel) assignments poorly. This can be seen by 
comparing the performances of the dot product and the vector multiply. Each 
of these two vector operations requires one floating point per processor. In 
addition, the dot product requires a reduction (accumulation phase) which 
necessitates communication. However, at high vp ratios, the dot product 
is twice as fast as the vector multiply! (At low vp ratios, the amount of 
floating point computations is not large enough to amortize the price of 
communication.) Since the dot product does not store any v«ilue in the 
processor memory and the vector multiply stores the result of x*y back into 
z, this leads us to believe that the C* compiler generates a code which is 
very inefficient at handling assignments. This also explains why the saxpy 
exhibits a higher megaflop rate than the vector multiply: it has twice as 
many floating point computations for one assignment. 


The same computations were repeated using *Lisp. The comparison of both 
sets of timings for the maximum vp demonstrates a formidable superiority of the 
*Lisp compiler (see fig. 8). This is partly due to the fact that it has been used 
longer on the CM-2 than C*. In spite of the proven superior efficiency of *Lisp 
over C*, we have chosen to implement our finite element code using C* because 
of our familiarity with C. 



FIG. 8. A Comparison of *Lisp and C* Performances 
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IV. FINITE ELEMENT PARALLEL DATA STRUCTURES 
Consider again the explicit central difference algorithm: 


d n+1/2 = d n " 1/2 + hM- l (F ex (t n ) - F‘"(d n ,d")) 
d n+l = d n + /td” +1/2 


( 4 ) 


The global mass matrix M is assembled once. At each time step t n , the compu- 
tations are dominated by the evaluation of the internal forces: 


F 


in 



T a dn 


where a is the stress vector, S are the shape functions, L is a partial derivative 
operator and ftM is the area of the e — th finite element. Clearly, the parallel 
computation of F* n is best done element-by-element. Thus, equation (l) can be 
efficiently integrated in parallel if the CM_2 virtual processors are mapped onto 
the elements of the mesh. This is a departure from the grid point massively par- 
allel computations advocated by Thinking Machines Corporation for the CM.2 
[14]. First, all processors compute concurrently the local forces F ei ^ ^(f n ) an< ^ 


F <n ^(d n ,d“). Next, these contributions are accumulated through communica- 
tions among processors that are mapped onto neighboring elements. 


In this section, we describe the finite element data structures which we have 
selected to drive the massively parallel computations on the CM.2. These are 
element oriented, while similar data structures proposed for other hypercubes 
are subdomain oriented (see Farhat, Wilson and Powell [15) and Fox et al. [16]). 
In Section VIII, we give further comments on this difference. We group these 
data structures into two sets. 


The first set of data structures deals with element-level parallel computa- 
tions. To be able to perform locally its assigned element-level computations 
— that is, to perform these computations without interacting with the front-end 
machine — each processor must store in its own memory its element type (truss, 
beam, shell, ..., number of Gauss points, ...), its element material properties 
(density, parameters and coefficients for constitutive equations, damping charac- 
teristics, thickness, ...), its nodal geometry (nodal coordinates, number of nodes 
per element), and its boundary conditions (fixed/free degrees of freedom at each 
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node, prescribed forces at each node). This information is compacted in one- 
dimensional arrays. In addition, each processor must also store in its 
a set of scalars corresponding to computational parameters such as the fixed 
time step /i, and a scalar or one-dimensional buffer for the temporary storage of 
messages to be passed to neighboring processors. 

The second set of data structures provides the router with the mechanism 
for parallel interprocessor communication. The inability of the NEWS gnd to 
handle irregular communication patterns has been addressed m Section II.2. Let p 
denote a virtual processor and e p its assigned finite element. In order to exchange 

an< j p.cas («) j virtual processor p must be able to identify at run time. 

• the set of processors mapped onto elements ajacent to e p 

• the nodes that e p shares with these elements 

• at each shared node, the degrees of freedom which need to be assembled. 

This particular information is vital for meshes with different types of ele- 
ments. It guarantees that, for example, a moment is not accumulated wit i 

a force, or that a force in the x direction is not accumulated with a force in 

the y direction. 

If the above information is gathered in a global form on the front-end ma- 
chine, most of the execution time which elapses during the accumulation phase 
would be due to message-passing between the CM.2 processors and the front-end 
computer. On the other hand, if this information is decentralized— that is, if the 
memory of processor p is loaded only with the subset of that information which 
is relevant to the connectivity of e p — the accumulation phase can be performed 
without any message-passing between the CM_2 and the front-end computer. 
Consequently, prior to any computation, the memory of processor p is loaded 
with the following one-dimensional arrays: 


Proc-att-to-node 


Pointer 


Location 


For each node connected to e p , it contains the identification of 
the processors that are mapped onto elements which are also 
connected to this node. These are stored in a stacked fashion. 
This is a pointer array. It stores in position t, the location in 
Proc-att.to.node of the list of vps that are attached to the node 
in the i — th local position. 

For each entry in Proc.att-to.node , this array specifies the local 
position of the shared node in the processor that is mapped 
onto an element adjacent to e p 


The above arrays are set up by the dedicated finite element mesh analyzer 
which was presented by Farhat, Wilson and Powell [15]. They require about 
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80 integer words per processor. Clearly, this is a very small overhead. The 
mechanism of these arrays is depicted in figure 9 for element 1. The mesh patch 
is composed of shell and beam elements. 


Element 1 

Proc.att.to. 

Pointer 

Location 



node [2, 3,3, 2] 
[ 1 , 2 , 2 , 3 , 5 ] 
[ 1 , 2 , 1 , 2 ) 


FIG. 9. A Distributed Data Structure 
for Interprocessor Communication 
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There is however one penalty associated with assigning one element to each 
vp. The nodes which are common to several elements are duplicated in their 
corresponding processors. As a result, about 11% of the total memory available 
on the CM_2 is wasted. This is a small price for the highly parallel computations 
that are achieved. Given the low cost of memory nowadays, this seems a worth- 
while trade. Moreover, this assignment allows I/O manipulations and graphic 
post-processing to be trivially parallelized. At each time step, after the nodal 
displacements are found, all of the processors draw concurrently the outline of 
their assigned elements on the frame buffer and send back the results to the front 
end in parallel. 


V. THE DECOMPOSITION /MAPPING STRATEGY 

Since the mesh irregularities inhibit the exploitation of the NEWS grtd, we rely on 
the data structures of Section IV to guide the router during interprocessor com- 
munication. However, there is still one additional problem to resolve. Efficiency 
in massively parallel computations requires the minimization of both the dis- 
tance that information must travel and, more importantly, the “hammering” on 
the router. In the case of finite element computations, this implies that adjacent 
elements must be assigned, as much as possible, to directly connected processors, 
and contention for the wire connecting neighboring chips must be reduced. This 
defines the mapping problem - that is, it defines which hardware processor is to 
be mapped onto which finite element of a given mesh. 

Farhat [19] developed a heuristic algorithm for mapping massively parallel 
processors onto finite element graphs and presented some analytical results for 
corresponding efficiency improvement. Basically, the algorithm searches itera- 
tively for a better mapping candidate through a two-step procedure for the mini- 
mization of the communication costs associated with a specific parallel processor 
topology. Because it seeks a very fast solution for a machine with thousands of 
processors, this algorithm does not guarantee “the” optimal mapping. However, 
it has produced very encouraging results on a variety of non-uniform two and 
three-dimensional meshes. 

In this work, we adapt the mapping algorithm of [19] to our target parallel 
processor, the CM_2. The 65536 processors of this machine are packaged into 
4096 16-processor chips, each having its own router node. The 4096 router nodes 
are arranged in a hypercube of dimension 12. To cope with this topology, we 
proceed in two steps. First, we decompose the given mesh into 4096 submeshes, 
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each containing 16 connected finite elements. Next, we apply the mapper given 
in [19] to identify which hardware chip is to be mapped onto which submesh. 
Finally, within each submesh, the elements are numbered randomly between the 
chip number and the chip number + 15. 

Given a finite element mesh, there are several ways to decompose it into 
16-elemcnt submeshes (see for example Farhat [17] and Malone [18]). Here, each 
submesh is to be assigned to one chip of the C^d_2. In figures 10, 11 and 12, we 
show two different decompositions for a discretized square domain, D i and D 2 . 



FIG. 10. Domain to be Decomposed 
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Both decompositions yield 16 submeshes, each with 16 adjacent elements. 
Decomposition Di was designed to minimize the communication bandwidth - 
that is, the maximum number of different chips with which any chip need to 
communicate. It can be seen (fig. 13) that for D L the bandwidth equals 2, while 
for Z?2 it equals 8. 



FIG. 13-a. Interchip Communication Pattern for Dl 



FIG. 13-b. Interchip Communication Pattern for D2 





It should be remarked that, if the substructuring approach [15, 16] had been 
chosen — that is assigning a subdomain to a physical processor, D 1 would have 
been more efficient than D 2 . For this decomposition, each chip would buffer the 
contributions of its interface nodes and send only two messages, one to the chip 
at its left and another to the chip at its right. The decomposition D 2 requires the 
same chip to send up to 8 buffered messages. These messages would eventually 
be shorter, but would still render D 2 more expensive because of message start- 
up costs. However, we have opted for a virtual processor approach — that is 
assigning one element to a virtual processor, for reasons that are given in Section 
VIII. For this case, processors exchange information one node at a time, so that 
the number of interface nodes associated with a decomposition is more important 
than its bandwidth. The reader can confirm that decomposition delivers 255 
interface nodes, while D 2 delivers only 93. Indeed, there is another equally, if not 
more important, reason why D 2 is better for the CM-2 than D i. In the case of 
D u all of the 16 processors of any chip communicate simultaneously with a set 
of processors which are on the same neighboring chip (fig. 14). This generates a 
significant amount of contention for the single wire that connects these two chips. 
In the case of D 2 however, one can observe (fig. 15) that: 

• for each chip, only 12 out of the 16 processors communicate with processors 

onto another chip 

• only 3 processors out of these 12 communicate simultaneously with the 
same neighboring chip, so that much less contention occurs for the wire 
connecting the two chips. We recall that each chip is connected with up to 
12 other ones using 12 different wires which can operate in parallel. 



FIG. 14. Wire Contention Induced by Decomposition Dl 
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FIG. 15. Wire Traffic for Decomposition D2 


The decomposition D i was obtained using a general purpose finite element 
decomposer presented by the first author in reference [17]. We advocate its use in 
conjunction with the mapper given in reference [19] for massively parallel compu- 
tations on the CM_2. The efficiency improvement potential of this preprocessing 
phase is demonstrated with the following finite element wave propagation prob- 
lem. Figure 16 shows the discretization of a tapered cantilever beam. The beam 
is modeled with 4-node isoparametric elements and linearly elastic plane stress 
constitutive equations. It is fixed at one end and subjected at the tip of the other 
to an impact point loading. The wave propagation nature of the problem dictates 
the meshing technique to create elements which are, as far as possible, of equal 
size. Since the beam is tapered, transition zones with irregular elements had to 
be introduced. Other mesh irregularities are due to the presence of a region with 
a hole. The complete mesh contains 8192 elements, which corresponds to an 8K 
CM_2. The use of a naive mapping (element i into processor i — 1) would have 
resulted in a maximum routing distance between adjacent elements equal to 9. 
Our decomposer/mapper reduces this distance to 5. If EFF denotes the efficiency 
(speed-up per processor) of the parallel computations using a naive mapping, and 
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/ is the factor by which the decomposer/mapper reduces the maximum routing 
distance between adjacent elements, the theoretical improved efficiency (Farhat, 
[19]) is given by: 


EFF* 



( 5 ) 


For this problem, we have measured an efficiency EFF = 40% on an 8K CM_2. 
Since / = 9/5, the predicted improved efficiency is EFF * = 54%. A second run 
of the problem using the decomposer/mapper has revealed a measured improved 
efficiency EFF* = 60%. The discrepancy between the predicted and measured 
improved efficiencies is due to the fact that (5) does not account for the wire 
contention problem. 




FIG. 16. Discretization and Decomposition of a Tapered Beam 
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VI. FLOWCHART OF THE MASSIVELY PARALLEL TRANSIENT 
SIMULATION 

The overall organization of the solution on the CM_2 of a transient dynamic 
problem using the explicit central difference algorithm is depicted in figure 17. 
It consists of four phases, namely: mesh preprocessing, data loading, number 
crunching, and data unloading. 


Read Input File (Front End) 

Decompose Mesh and Form Parallel Data Structure (Front End) 

Load Parallel Data Structure (Front End - CM_2) 

Compute Lumped Mass Matrix (CM_2) 

Compute Critical Time Step (CM_2) 

Loop on Time Steps (Front End) 

{ 

Compute Internal and External Local Forces (CM_2) 

Assemble Global Forces (Intorprocessor Communication) 
Compute Velocities, Displacements, Strains and Stresses (CM_2) 

Visualize Results (CM_2 - Frame Buffer) 

Archive Results (C1VI_2 - Data Vault) 

} 


FIG. 17. Solution of a Transient Problem on the CM_2 
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A conservative stable time step for the central difference algorithm is given 


by 


h < 


2 


(e) 

Umax 


( 6 ) 


where u^L is the maximum element frequency of the undamped dynamic prob- 
lem. Belytschko has pointed out that it is in fact usually not practical to compute 
the maximum eigenvalues of the element directly, for this would increase the cost 
of computation considerably [20]. Instead, formulas for upper bounds on u m „x 
have been recommended. However, on massively parallel processors such as the 
CM-2, the parallelism inherent in the computation of Umax is such that this 
frequency is obtained at the cost of the frequency of one single element. 

The interprocessor communication mechanism for a mesh with more than 
one type of element is illustrated in figure 18. For the example shown, the 4-node 
elements are activated first. They communicate in four steps, one node at a time. 
Next, the 4-node elements are de-activated and the truss elements are selected. 
These communicate in two steps. As explained in Section II.2, the serialization 
between different types of elements is due to the SIMD nature of the CM_2. 


27 





FIG. 18. Interprocessor Communication For a Hybrid Patch 
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VII. EXAMPLES 


In this section, we apply our approach to massively parallel finite element explicit 
computations to the solution of various transient problems on an 8K CM_2 with 
Weitek accelerators. We analyze performance results in detail. We assess the effi- 
ciency of our decomposition/mapping strategy at reducing communication time. 
We highlight the impact on machine performance of variations in mesh topol- 
ogy, finite element modeling, and problem nonlinearities. We also report on the 
performance of the Data Vault system for problems that are I/O bound. 

For each example, two simulations were carried out. The first one assumed 
a linear elastic material. In the second simulation, the material was assumed to 
have an elastoplastic behavior governed by a Von Miscs yield condition. 

VII.l El: Transient Response of a Cracked Aluminium Plate 

The quarter of a mesh in figure 19 was generated to study the dynamic response 
of a cracked aluminum plate under a uniform time varying loading. The full 
mesh contained a total of 4008 plane stress elements and 4073 nodes. Mesh 
irregularities were induced by transition zones. The NEWS grid could not be 
used. 



FIG. 19. A Quarter of a Mesh for a Cracked Plate 
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VII. 2 E2: Wave Propagation in a Three-Dimensional Bar 

The second example considered was the impact of a metallic ball on an unsup- 
ported glassy bar. The bar was discretized using 8160 brick elements (fig. 20). 
The finite element mesh contained 13500 nodes and 40500 degrees of freedom. 
Given the regularity of the discretization, the NEWS grid was used for inter- 
processor communication. This example was also re-run using the router for 



FIG. 20. Finite Element Discretization of a Glassy Bar 


VII. 3 E3: Shuttle Docking Induced Vibrations in a Space Station 

This dynamic analysis was carried out to investigate the vibrations of a space 
station model assembled from 5-meter erectable struts. These vibrations were 
assumed to be induced by a shuttle docking. The finite element model (fig. 
21) comprised 7584 three-dimensional truss elements and 2304 nodes. It was 
generated by aligning identical cells along various axes. However, each cell by 
itself was irregular (fig. 22) and did not allow the use of the NEWS grid. 
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VII. 4 Ej: Three-Dimensional Glassy Bar on an Elastic Foundation 

The wave propagation example problem E2 was repeated with different boundary 
conditions. The glassy bar was assumed to be supported by a layer of foam. The 
mesh was comprised of a total of 8164 elements (which is very close to the number 
of elements in the former mesh), of which 1636 truss elements were used to model 
an elastic foundation. 

VII. 5 Performance Results and Analysis 

All segments of code were written exclusively in C*. Floating-point arithmetic 
was performed in single precision (32 bit words). Measured performance results 
are gathered in tables 2, 3, 4, 5 and 6. Only example E2 could make use of the 
NEWS grid. However, all timings except those given in table 6 correspond to 
runs where communication was carried through the router. Execution times are 
given in seconds and correspond to a sample of 2000 time integration steps and 
a vp ratio equal to 1. 

TABLE 2. Overall Measured Performance 
for Various Transient Finite Element Computations 


Example 


Data Loading 
in the CM 2 

Equation of Motion 
Solving 

Sustained 
MELQPS 

E 1 - elatic 

1 .04 secs 

5.47 secs 

861 secs 

400 

E l - elastonlastic 

1.04 secs 

5.47 secs 

1033 secs 

480 

E2 - elatic 

1 .98 secs 

31.78 secs 

4139 secs 

392 

E 2 - elastonlastic 

1 .98 secs 

31.78 secs 

4718 secs 

440 

E 3 - elatic 

1 .28 secs 

1 3.56 secs 

887 secs 

254 

E 3 - elastonlastic 

1 .28 secs 

1 3.56 secs 

896 secs 

256 

E4 - elatic 

2.11 secs 

33.00 secs 

4770 secs 

340 

E4 - elastonlastic 

2.11 secs 

33.00 secs 

5440 secs 

386 


The mesh preprocessing phase corresponds to the decomposition of the finite 
element mesh as explained in Section V. It also includes the setup of the finite 
element parallel data structure, which is then distributed across the processors. 
Both of these phases are shown to require relatively very little computer time. It 
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can also be observed that in the worst case, the nonlinear computations consume 
only about 15% additional time. This is due to the explicit nature of the radial 
return mapping algorithm that was used. Because of “what you see is what you 
get” , the reported mflop rates should be compared to those measured in Section 
III and not to the theoretical peak performance of the machine. It should also be 
noted that our C* code still leaves room for further optimizations. 

TABLE 3. Data Vault System Performance 


Example 

Solving Equation Unloading Results Unloading Results 

of Motion on Front End on Data Vault 

El 

861 secs 5340 secs 3.81 secs 

E 2 

4139 secs 16400 secs 12.61 secs 

E3 

887 secs 9500 secs 7.04 secs 


For examples El, E2 , and E3, the computed displacements, strains and 
stresses were archived on secondary storage after each time integration step. Two 
solutions were compared. In the first case, these results were brought back to the 
front end and stored in appropriate disk files. For that case, the measurements 
given in table 3 demonstrate that the amount of involved I/O dominated the 
simulation total time. In the second case, the results were transferred in parallel 
directly to a Data Vault System. The speed-up provided by the Data Vault is 
shown to be of the order of 1400! This parallel I/O capability is what was most 
lacking on earlier hypcrcubcs [18]. 
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TABLE 4. Computation vs. Communication 


Example 

Solving Equation 
of Motion 

Computation 

Time 

Communication 

Time 

El 

861 secs 

460 secs 

401 secs 

E 2 

4139 secs 

1959 secs 

2180 secs 

E 3 

887 secs 

260 secs 

627 secs 

E4 

4770 secs 

2340 secs 

2430 secs 


If T cp and T cm are respectively the computation parallel time and the com- 
munication parallel time, and N p is the number of available processors on a 
given parallel machine, the achieved efficiency (speed-up per processor) can be 
expressed as: 


EFF = 


J_ N v T °r 

N p T cp + T cm 



The results given in table 4 indicate that efficiencies of 53%, 47%, 29% and 49% 
are achieved respectively for examples El, E2, E3 and E4. If one refers to the 
performance results of Section III, it can be seen that the sustained mflop rates 
reported in table 2 are consistent with these efficiencies. At the first glance, these 
efficiency results appear to be very pessimistic. However, they are well above 
the 10% often obtained on current vector supercomputers [21]. The reader can 
observe that the timing results for example E\ are very close to the cumulative 
timings of examples E2 and £3, which illustrates the impact of the SIMD nature 
of the CM_2 on the MIMD nature of finite element computations. It should 
also be noted that while the communication time is fixed for a given mesh, the 
computation time increases with the complexity of the analysis. Thus, highly 
nonlinear formulations which include large deformations are expected to yield 
higher efficiencies than those deduced from table 4. 
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At this point, we give further details regarding interprocessor communication 
in the context of finite element explicit computations. As outlined in Section 
V, the finite elements of a mesh exchange their local contributions one node 
at a time. For a given finite element, this information exchange procedure is 
organized around two nested loops. The outer loop is carried over the nodes that 
are connected to this element. The inner loop is carried over the neighboring 
elements that are attached to each local node. Using a C notation, this is written 

as: 


for (node = 1; node < my. nodes; node-h+) (7) 

start — pomterfnodej ; stop = potnterfnode -f- l) - 1, 
for (position = start; position < stop; position++) (8) 

neighbor = proc.att.to.node[positionJ ; 

exchange (variable, myself, neighbor); 

> 

} 

where my .nodes is the total number of nodes that are connected to a given fi- 
nite element and proc.attJo.node is the array containing the identification of the 
neighboring elements. Clearly, these variables are element dependent. The total 
number of communications to be performed by one processor is determined by 
the product Pcnl = d * (pointer [my. nodes + l] - l) which is both element and 
mesh dependent. The CM.2 being an SIMD machine, the communication time is 
determined by max e {P^l}. For a regular mesh composed of three-dimensional 
truss elements (d = 3) or 4-node plane elements (d = 2), every node is attached 
to 4 elements, so that 24 communication instructions per time integration step 
are required for the truss element and 32 for the 4-node plane element. However, 
table 4 indicates that the space station example exhibits a longer communication 
time than the aluminum plate problem. The reason is that in the mesh of exam- 
ple E3, some truss elements are connected to 12 other elements. Because of the 
SIMD nature of the CM-2, the element with the highest degree of connectivity 
determines the communication time. For a regular mesh with 8-node solid ele- 
ments (d = 3) each time integration step is followed by 192 communication steps, 
since each node can be attached up to eight different elements. This is reflected in 
table 4 where example E2 is shown to possess by far the longest communication 
time (2180 secs). In summary, the amount of communication involved in finite el- 
ement explicit computations on the CM_2 is determined by the element topology 
and order, and the mesh irregularities. Because only d nodal information are ex- 
changed at a time among the CM_2 processors, three-dimensional and high order 
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elements substantially increase the communication time. Mesh irregularities also 
adversely affect the amount of communication because of the SIMD nature of the 
CM_2. It is interesting to note that elements which transmit physical information 
across edges and faces such as those proposed by De Veubeke, [22] would require 
much less communication than traditional elements. These elements should be 
revisited for computations on massively parallel processors such as the CM_2. 

An in-depth investigation of the communication phase was carried out. It 
was found that most of the communication time was elapsed in the header of loop 
(8). This loop header involves the quantities start and stop which differ from one 
processor to another in the presence of mesh irregularities and different element 
types. Consequently, the front end computer has to process and manage several 
different loops rather than a unique one, which is not very efficient on an SIMD 
machine. The time associated with the headers of loops (7) and (8) is referred 
to as software overhead in table 5. The true time that is elapsed in effective 
communication among the processors is shown to be only a fraction of the overall 
communication time (see table 5). 

TABLE 5. True Communication Time 


Example 

Computation 

Effective 

Software 


Time 

Coomunication Time 

Overhead 

£1 

460 secs 

81 secs 

320 secs 

E 2 

1959 secs 

1380 secs 

1280 secs 

£3 

260 secs 

146 secs 

481 secs 


Because it was designed to handle arbitrary meshes, our C* code did not 
make use of the NE WS grid package. However, a special module that incorporated 
calls to the NEWS grid was written specifically for the regular mesh of example 
E2. Execution times for this example using both the NEWS grid and the router 
are shown in table 6. Clearly, a high price is paid for the handling of eventual 
mesh irregularities. 
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However, the irregular pattern of communication is fixed in time. Thus, a 
considerable improvement can be achieved if this pattern is evaluated at the first 
time step, then somehow stored in the CM_2 for use during subsequent time steps 
We believe that this is an issue that massively parallel computer architects should 

investigate. 


TABLE 6. Router vs. NEWS Grid 


Example 

Computation Communication Time Communication Time 

Time Using the NEWS grid Using the Router 

E 2 

4139 secs 560 secs 2660 secs 


In order to assess the performance of the decomposer/mapper module, exam- 
ples El E2 and E3 were re-run with the naive shifted identity mapping (element 
i in processor « - 1). Figure 23 demonstrates that the true communication time 
can be reduced by as much as 60 %. Unfortunately, the total execution time 
is reduced only between 10% and 17% because of the communication software 
overhead associated with mesh irregularities. 
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FIG. 23. The Dccomposer/Mapper Performance 


Vin. CONCLUDING REMARKS 

We have reported herein on our experience in performing transient finite element 
computations on the CM_2. We have presented the architectural features of this 
parallel processor and discussed their impact on finite element computational 
strategies. In particular, those features which distinguish the CM_2 from earlier 
hypercubes have been emphasized. These include the virtual processor concept 
and the fast parallel I/O capabilities. The processor memory size of 64 Kbits 
has been shown to penalize high order elements. We have also described and 
discussed a domain decomposition strategy and a mapping algorithm which are 
suitable for massively parallel processors such as the Connection Machine. The 
main idea behind the decomposition technique is the minimization of both the 
amount of wire contention within a chip, and the amount of communications be- 
tween different chips. A given finite element mesh is partitioned into 16-element 
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subdomains which correspond to the 16-processor chips of the Connection Ma- 
chine. This partitioning is carried out in a way that minimizes the number of 
nodes at the interface between the subdomains. As a result, only those processors 
which are mapped onto finite elements at the periphery of a subdomain commu- 
nicate with processors packaged on different chips. Moreover, this partitioning is 
such that the connectivity bandwidth of the resulting subdomains is large enough 
to allow an efficient use of the interchip wires. The mapping algorithm attempts 
at reducing the distance information has to travel throug the communication 
network. In essence, it searches iteratively for an optimal mapping through a 
two-step minimization of the communication costs associated with a candidate 
mapping. Various issues related to the single instruction multiple data stream na- 
ture of the CM.2 and pertinent to computational mechanics have been addressed. 
Measured performance results for realistic two and three dimensional transient 
problems have been reported. Three-dimensional and high order elements have 
been shown to induce longer communication times. Mesh irregularities have been 
shown to slow down the computation speed in many ways. The Data Vault has 
been demonstrated to be very effective at reducing the I/O time. 

Now, we briefly highlight some additional implementational and theoretical 
issues that we hope will materially advance the application ranges of finite element 
computations on this highly parallel processor. 

Virtual Processor Ratio vs. Substructuring 

In this work, we have assigned when possible more than one finite element to a 
single processor using the virtual processor feature of the CM.2. However, an- 
other way to obtain the same result is to assign a substructure to an individual 
processor (Farhat, Wilson and Powell, [15] and Fox et ah, [16]). From a numer- 
ical point of view, both approaches are equivalent. However, these two distinct 
approaches differ in their implementations and may perform differently. The 
substructure approach requires each processor to work with both external and 
internal data structures. The set of external data structures stores information 
about substructure interconnections. These are similar to the ones described in 
this paper. The set of internal data structures stores the connectivity table of 
the elements within a substructure. The computations within each substructure 
are carried out by looping over the elements of that substructure. The advantage 
of this approach is a saving in storage since the substructure internal nodes are 
uniquely defined, and a faster computation of the results associated with these 
nodes. Moreover, the global results at the internal nodes can be accumulated 
without any explicit call to a message-passing function. The global quantities 
at the boundary nodes are accumulated using the router and the external data 
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structures. However, the substructuring approach requires that the sequencer 
broadcast the same instruction several times, once for each element of the sub- 
structure, which increases the overall wall clock execution time. Moreover, this 
approach does not allow the Weitek chip to pipeline the computations over the 
elements of the substructure. 

On the other hand, the virtual processor approach requires that each element 
communicate explicitly with its neighbors, even if these are assigned to the same 
processor. Of course, this communication is virtual since it is within the proces- 
sor itself and generates minimal additional overhead. On the positive side, the 
virtual processor approach utilizes only one type of data structure and exploits 
the pipelining capabilities of the Weitek chip. The latter feature significantly 
enhances overall performance, as demonstrated in Section III. Consequently, we 
advocate the use of the virtual processor ratio rather than the substructuring 
technique, especially if the processor memory size is to be increased in the future. 

Implicit Algorithms and the CMS 

In this report, no attempt has been made to design a novel parallel algorithm for 
the solution of the differential equation of motion. We have selected the central 
difference algorithm because of its inherent parallelism, which allowed us to focus 
on implementational issues and to fully explore the multiprocessig capabilities 
of the CM_2. Our experience suggests that a whole class of explicit and semi- 
implicit dynamic and static algorithms can be implemented on the CM_2 in a 
very similar way. Among others, we cite the EBE algorithms (Hughes et al., 
[23]), the EBE preconditioners (Hughes, Ferencz and Hallquist, [24]), and the 
Jacobi preconditioned conjugate gradient algorithm (Golub and Van Loan, [25j). 
However, the solution of some static and transient problems may necessitate 
the use of an implicit algorithm, which usually implies the solution of a set of 
simultaneous banded equations. If the global symmetric stiffness matrix K is 
banded, with semi-bandwidth 6, then it is well known (see for example Ortega 
and Voigt, [26]) that Gaussian elimination methods for solving Kd = F allow 
at each step on the order of ^ P airs of (+> x ) to be processed concurrently, but 
require significant communication beeaxise the 6 entries of the pivot column must 
be made available to all other processors. Several parallel algorithms based on 
these elimination methods were designed for finite element applications and were 
implemented on ealier hypercubes (see for example, Farhat and Wilson [27] and 
Utku, Salama and Melosh, [28]). Typically, a processor was assigned to a set 
of matrix columns. Results from our previous experience with the early version 
of Intel’s iPSC suggest that direct solvers are feasible on hypercubes only when 
the number of available processors, N v , is much smaller than the bandwidth b 
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of the given finite element problem, so that communications do not dominate 
computations. On the iPSC-1, a message that was sent from one extreme corner 
of a 5-dimensional cube to the other would result in an elapsed time 475 times 
longer than the time to perform a floating point multiplication (see Rudell, [29]). 
However, on a 10-dimensional subcube of the CM_2 we have measured the ratio 
of a broadcast to a floating point computation to be only about 2.87. This 
observation suggests that for problems with b > 360, a processor could be mapped 
onto a few matrix entries and a parallel direct solver could be feasible on the 
CM_2. For problems with smaller bandwidth, direct solvers which operate on 
more than one pivot at a time (Alaghband and Jordan, [30]; Peters, [31]) should 
also be investigated for implementation on massively parallel processors. 

There is an additional issue which has to be examined before attempting to 
solve finite element equations on the CM_2 with a parallel direct solver. This issue 
is related to the balance on massively parallel processors between the number 
of available processors, N,„ and the processor memory size. Let M n denote a 
two-dimensional regular n by n finite element mesh, where n is the number of 
elements along one side. If d is the number of degrees of freedom at a given node, 
the semi-bandwidth of M n is b = d{n + 3) and the total number of mathematical 
unknowns is N = d(n + l) 2 . For this mesh, the storage cost of K amounts to 
Nb = d 2 (n + 3)(n+l) 2 words. The total amount of storage available on the CM_2 
is S = N v * m p , where N v is the number of available processors and m p = 8 
Kbytes is the current size of the processor memory. Let NE = nj ntix be the 
maximum number of elements for which M n has a banded stiffness matrix that 
can be factored in-core on the CM_2. Table 7 below gives the values of NE for 
different values of d and for the case of a fully configured Connection Machine 
(N p = 65536). Values of NE are shown for both single precision (32 bit words) 
and double precision (64 bit words) floating-point arithmetic. 
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TABLE 7. Number of Allowable Elements vs. DOF /Node 
for the Two-Dimensional Case 


N p =65536 

B 

D 

d=4 

D 

B 

Single Precision 

NE 

102400 

59536 

40401 

29929 

23409 

Double Precision 

NE 

64009 

37249 

25281 

18769 

14884 


Clearly, except for the case where d = 2 and floating-point arithmetic is done 
in single precision, NE is smaller than N p . Similarly, the case where M n is an n 
by n by n three-dimensional regular mesh is assessed in table 8 below for various 
values of d. 


TABLE 8. Number of Allowable Elements vs. DOF /Node 
for the Three-Dimensional Case 


N p = 65536 

m 

B 

4=4 

B 

n 

Single Precision 

NE 

29791 

19683 

13824 

10648 

8000 

Double Precision 

NE 

19683 

12167 

9261 

6859 

4913 
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For this case, NE is much smaller than N p , even for d = 2 and for single 
precision floating-point arithmetic. For d, = 6 (some shell elements), only 8000 
elements (4000 elements) can be included in M n when computations are carried 
out using single precision (doubl precision) floating-point arithmetic. 

It is noted that the eventual solution of a system of equations is only 
one phase of several finite element computational sequences. In linear three- 
dimensional analysis, this phase dominates the computer execution time. How- 
ever, in the nonlinear analysis of flexible space structures most of the computa- 
tional time is usually spent in modules that perform element level computations 
[32]. These include the evaluation of generalized nodal internal forces and/or 
elemental stiffness matrices. Consider now a mesh M n where the number of el- 
ements NE is chosen so that the upper part of the banded stiffness matrix K 
fills the N p processor memories completely. The preceding complexity analysis 
demonstrates that the balance on the CM.2 between the number of processors 
and the memory size of each processor is such that NE is much smaller than N p . 
Hence, if a direct algorithm is used to solve a finite element system of equations, 
the Np processors will be active during the solution phase, but N p - E processors 
will remain idle during the rest of the phases which involve element level com- 
putations. Consequently, an in-core direct solution strategy would not efficiently 
utilize the computational power of the CM_2 in a highly nonlinear finite element 

analysis. 
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ABSTRACT 

The various forms of parallel numerical algorithms that speed 
up finite element computations are as numerous as the number 
of researchers working on the problem. In this paper, we re- 
view some of these parallel computational strategies and assess 
their adequacy for a given architecture and a given problem. We 
also report on the performance of both extreme parallel hardware 
technologies on real-life structural problems. 

I INTRODUCTION 

The realistic simulation of the nonlinear dynamics of complex 
structural systems remains beyond the feasible range of tradi- 
tional computers. It has been the author s experience that the 
simulation of the transient response of a space station model with 
100,000 degrees of freedom to various loading configurations con- 
sumes over 10 CPU hours on a CRAY- 2 supercomputer and that 
the simulation of the deployment of a space structure is even more 
computationally demanding, especially if the control/structure in- 
teraction problem is to be included. The aeroelastic response of 
a detailed wing-body configuration using a potential flow theory 
requires about 5 CPU hours using the same supercomputer. In 
order to establish the transonic flutter boundary for a given set 
of aeroelastic parameters, about 30 aeroelastic response analyses 
are required, which brings the total CPU time to 6 days. If the 
full Navier-Stokes equations are to be solved, it is estimated that 
the CPU time increases by two orders of magnitude. It is also 
clear that large amounts of data can be generated in a large-scale 
transient structural analysis or a large-scale computational fluid 
dynamic solution. This raw data has to be interpreted, in real- 
time if possible, in order to be understood. 

Clearly, the true potential for execution improvement lies in 
massively parallel and/or parallel/ vector supercomputing. The 
commercial supercomputer manufacturers of the last decade have 
extended their products into configurations that use a few vector 
processors coupled around a massive shared memory (CRAY-2, 


CRAY X-MP, CRAY Y-MP). Supercomputers with a larger num- 
ber of vector processors are also under development (CRAY3). 
Concurrent multiprocessors with much finer granularity and a 
wide range of interconnection strategies are now appearing. Re- 
cently, massively parallel computers such as the CONNECTION 
MACHINE have demonstrated their potential to be the fastest su- 
percomputers, a trend that may accelerate in the future ( McBrvan 
[1]). The advent of advanced frame buffers and high performance 
workstations such as Ardent's TITAN now makes real-time visu- 
alization possible. 

Moving engineering applications to concurrent processors 
faces significant obstacles that will have to be resolved as such 
machines become more and more available. The obstacles center 
on algorithms, methods, languages, and education. In this paper, 
we address some of these issues in the context of finite element 
computations. 

The various forms of parallel numerical algorithms that speed 
up finite element computations are as numerous as the number of 
researchers working on the problem. Extensive lists of references 
on this topic may be found in the surveys of Noor [2], White 
and Abel [3], and Ortega, Voigt and Romine [4]. Throughout 
this paper, we discuss the adequacy of a set of parallel finite el- 
ement computational strategies (mesh preprocessing, solution al- 
gorithms, I/O manipulations) for a given parallel processor and a 
given structured and/or mechanical problem. This leads us to the 
introduction of the notion of algorithmic portability in addition to 
the problem of language portability. 

The remainder of this paper is organized as follows. In Sec- 
tion II, we present an overview of the present status of paral- 
lel computers that is pertinent to finite element computations. 
Through the examples of SIMD (Single Instruction Multiple 
Data), MIMD (Multiple Instruction Multiple Data), local memory 
and shared memory multiprocessors, we address the impact of 
hardware architecture on the design and implementation of par- 
allel algorithms and parallel data structures. Section III focuses 


35 


PRECEDING PAGE BLANK NOT F'Uv F' 



on local memory MIMD hypercubes and Section IV on shared 
memory multiprocessors. Section V summarizes the author s ex- 
perience with massively parallel finite element computations on 
the CONNECTION MACHINE. Performance results and con- 
cluding remarks are offered in Section VI and Section V II. 

Because of space limitations, algorithmic details and formulas 
are avoided. The paper emphasizes major results and conclusions. 
For specific details, the reader is urged to consult the references. 

II WHAT ONE MUST KNOW ABOUT PARALLEL 
PROCESSORS 

Several parallel computers have already been marketed commer- 
cially. Rather than discuss these individually, we here focus on 
presenting an overview of their architecture and emphasize the im- 
pact of their hardware features on the design and implementation 
of parallel computational strategies for finite element simulations. 

A review of some of the commercially available parallel systems 
can be found in Babb [5], where programming examples are also 
provided. 

Multiprocessors can be generally described by three essential 
elements: granularity, topology and control. 

Granularity relates to the number of processors and involves 
the size of these processors. A fine-grain multiprocessor features 
a large number of usually very small and simple processors. The 
CONNECTION MACHINE (65,536 processors) is such a mas- 
sively parallel supercomputer. NCUBE’s 1024-node and iPSC s 
128-node models are comparatively medium-grain machines. On 
the other hand, a coarse-grain multiprocessor is typically built by 
interconnecting a small number of large, powerful processors, 
usually but not necessarily vector processors. ALLIANT FX/8 (8 
processors), IBM 3090- VF (6 processors), CRAY X-MP (4 pro- 
cessors), CRAY-2 (4 processors) and the ETA-10 (8 processors) 
are examples of such multiprocessors and supermultiprocessors. 
Granularity directly affects the parallel computational strategy. 
On a coarse-grain multiprocessor, finite element computations can 
be parallelized at the subdomain level. On a fine-grain machine, 
they are best parallelized at the element and sometimes at the 
degree of freedom level. When designing parallel algorithms for 
finite element computations on coarse grained vector super-multi- 
processors, one should preserve vectorization. This is because the 
potential speed-up due to interconnecting a few vector processors 
cannot compete with the speed-up due to the vector capabilities 
of a single processor. This matter is addressed and emphasized 
in Section IV. 

Topology refers to the pattern in which the processors are 
connected and reflects how data will flow. Currently available 
designs include hypercube arrangement, network of busses, and 
banyan networks. Usually, the interconnection topology is re- 
lated to the memory organization. For example, iPSC, NCUBE 
and the CONNECTION MACHINE are local memory multipro- 
cessors with a hypercube topology. On these systems, a processor 
is assigned its own (local) memory and can only access this mem- 
ory. Independent processors communicate by sending each other 
messages. Efficient solution of finite element simulations on these 
machines requires minimizing the interprocessor communication 
bandwidth, especially when the communication hardware /software 


is relatively slow. This requires the mapping of adjacent elements 
as much as possible onto directly connected processors, which may 
be no trivial problem. On the other hand, the processors on a 
shared memory system such as ALLIANT FX/8 are connected 
through a common memory bus and can access the same (global) 
large memory system. Adequate finite element parallel data struc- 
tures are crucial for efficient computations on both shared and 
local memory multiprocessors. On a local memory machine, one 
has to introduce the concept of distributed data base and data 
structure. Each local memory is loaded only -with the data rele- 
vant to the computational task assigned to its attached processor. 
For a system with thousands of processors, the total amount of 
available memory can be very large. Yet, it is the storage capac- 
ity of each local memory which really matters. Different finite 
elements require different amounts of data to be stored. For each 
finite element in the mesh, a material and geometrical nonlinear 
high order shell element may require an amount of data storage 
two orders of magnitude higher than a simple linear truss element. 
Hence, one may be able to assign one or several finite elements 
of a certain type to one processor but may fail in the attempt to 
assign one or several elements of another type to a similar pro- 
cessor. Also, in the case of MIMD machines such as iPSC and 
NCUBE, one has to ensure that the compiled subroutines can be 
accommodated on the local memory. Consider the case where a 
processor is mapped onto a submesh containing different types of 
elements. In this situation, one has to load into the processor s 
local memory all the element libraries for the types encountered 
in the assigned submesh. Generally, one can overcome these prob- 
lems by devising an intelligent partitioning scheme and a compact 
data structure. Careful data structures must also be designed for 
shared memory multiprocessors to avoid potential serializations 
due to memory conflicts. 

Control describes the way the work is divided up and syn- 
chronized. Of particular interest are the SIMD and MIMD ma- 
chines. The CRAY-2 (4 processors) and iPSC (128 processors) are 
respectively a shared memory MIMD supermultiprocessor and a 
local memory MIMD hypercube. They can simultaneously ex- 
ecute multiple instructions which can operate on multiple data. 
The CONNECTION MACHINE is an SIMD system where a sin- 
gle instruction is executed at a time, - an instruction which can 
operate on multiple data. Typically, on an SIMD machine a sin- 
gle program executes on the front end and its parallel instructions 
are submitted to the processors. On an MIMD parallel processor 
separate program copies execute on separate processors. 

Practically, local memory parallel processors are more diffi- 
cult to program than shared memory multiprocessors. However, 
this does not imply that optimal performance is easily achieved 
on shared memory machines, especially when vector processors 
are interconnected. It is believed that local memory systems are 
easier to scale to a large number of processors. Shared mem- 
ory multiprocessors are usually coarse grained because the bus to 
memory saturates and/or becomes prohibitively expensive above 
a few processors. However, machines such as Evans and Suther- 
lands" ES-1 and MYRIAS are considered as shared memory mul- 
tiprocessors and can be configured with several hundreds of pro- 
cessors. Note also that on SIMD machines, one has to devise 
special tricks to be able to process parallel finite elements of dif- 
ferent types, since these do not involve the same instructions and 
only one instruction can be executed at a time. 


Ill FE COMPUTATIONS ON MIMD LOCAL MEM- 
ORY MULTIPROCESSORS 

Several solution algorithms have been designed for static, modal 
and transient finite element analyses on MIMD local memory 
multiprocessors. Examples of these can be found in Farhat and 
Wilson [6] (Intel's iPSC), Lyzenga, Raefsky and Hager [7] and 
Nour-Omid, Raefsky and Lyzenga [8] ( JPL/Caltech’s MARK III). 
Typically, these algorithms stem from the divide and conquer 
paradigm. 

Consider the finite element discretization of the mechanical 
joint shown in figure 1. If the complete finite element system is 
subdivided into N s subdomains, each group of elements within a 
subdomain can be processed in parallel. The data structure for 
such an approach is very simple. On local memory multiproces- 
sors, only the storage for the node geometry and element prop- 
erties within the substructure need be stored within the RAM 
(Random Access Memory) of the processor assigned to that sub- 
domain. In addition, concurrent formation and reduction of the 
mass, damping and stiffness matrices for that region require no 
interprocessor communication. Message passing occurs only when 
transfering solutions between subdomain interconnected bound- 
aries. The latter phase often determines the efficiency of the par- 
allel computational approach. W r hile load balancing is an impor- 
tant criterion for automatically subdividing a mesh into as many 
submeshes as there are available processors, N p = A 7 ,, it is not 
sufficient by itself to determine the partitioning algorithm. 



Fig. I Discretization of a mechanical joint 

Suppose that a parallel explicit or explicit-like algorithm is 
to be implemented on an MIMD local memory multiprocessor. 
It could be, for example, an iterative solver for the linearized 
static problem, or a time integration explicit algorithm for the 
transient response analysis. Typically, these computations involve 
matrix-vector products Ku and inner products u T u which can be 
evaluated in parallel as: 

J=N. 

Ku = J2 K , u i 
;=1 
;=.v, 

uTu = Y u T u > 

;=i 


where K } denotes the stiffness of the j - th subdomain and u } 
is the localization of the displacement vector to the j - th sub- 
domain. In this case, only neighboring subdomains need to ex- 
change boundary information. Hence, an optimal decomposition 
is the one which minimizes the communication bandwidth of the 
problem, — that is, the subdomain connectivity. This strategy is 
discussed by Malone in [9J. When applied to the above problem, 
for N p ~ 32, it delivers the partitioning shown in figures 2a-2b. 
The average and maximum communication bandwidths are 5 and 
8 respectively, and the number of interface nodes is 718. 



Fig. 2a Decomposition with N p = 32 
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Fig. 2b Interprocessor communication pattern 

Suppose now that parallel implicit static or dynamic com- 
putations are to be invoked. In this case, a higher level of par- 
allelism is obtained by treating the interface nodes as a separate 
entity, and numbering the unknowns so that, for example, the 
stiffness matrix has the pattern shown in figure 3b. The subma- 
trices Kjj'K/i and I\ } [ denote respectively the subdomain and 
interface stiffnesses, and the coupling term. Clearly, all subdo- 
mains can be processed in parallel after the interface problem 
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(K„ - ’V Kji)u i = /, - Kj t KJ>fi ( 2 ) 

J-l 

has been solved. In equation (2), u/ and // are respectively 
the generalized displacements and forces at the interface nodes. 
The size of the interface problem determines the efficiency of this 
parallel stratagem. An optimal decomposition for this approach 
which minimizes the number of interface nodes is presented by 
Farhat in [10], When applied to the above mechanical joint (fig. 
3a), for N p = 32, it delivers 356 interface nodes only. 



Fig. 3a Decomposition: N p = 32 and interface mini- 
mization 
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Fig. 3b Pattern for stiffness matrix 

Equation (2) can be solved using a direct method (Farhat, 
Wilson and Powell [11]), or an iterative one (Farhat and Wil- 
son [12], Nour-Omid, Raefsky and Lyzenga [8]). Let nj denote 
the average number of interface nodes per subdomain, and d the 


number of degrees of freedom per node. If the interface prob- 
lem is treated with a direct solver, the formation of Schur’s com- 
plement Kfj — P KjiKjtKn requires 2N,n 3 I d solutions of 

sparse triangular systems. On the other hand, each conjugate 
gradient iteration involves N s matrix-vector products of the form 
KJjKJj K } iu\ k \ which require 2.V, solutions of sparse triangu- 
lar systems. If memory is an issue, the fill-in of (2) can be such 
that an iterative solver, for example the preconditioned conju- 
gate gradient method, is recommended for the solution of the 
interface problem. If the coupling between subdomains is very 
strong, a preconditioned conjugate gradient algorithm may re- 
quire more than njd iterations to achieve convergence, so that a 
direct method becomes more advantageous. 

Parallel modal and transient analyses using both approaches 
have been experimented on Intel’s iPSC (Farhat and Wilson [13], 
Malone [9]). 

The reader should note that when using implicit compu- 
tations, the substructuring technique introduces a high level of 
parallelism, however sometimes at the cost of additional floating 
point computations. On the other hand, parallel direct solvers do 
not increase the computational complexity, but on local memory 
multiprocessors, they may suffer from interprocessor communica- 
tion costs. For this reason, and because the degree of parallelism 
direct parallel solvers offer is limited by the mesh bandwidth, 
the author does not recommend their use for the solution of the 
entire finite element system on currently available local memory' 
multiprocessors, especially if the number of processors is large, 
say N p > 128. However, they have been successfully combined 
with the substructuring technique to solve the interface problem 
only (see Farhat, Wilson and Powell [11]). 

IV FE COMPUTATIONS ON MIMD SHARED MEM- 
ORY MULTIPROCESSORS 

In principle, parallel algorithms which are developed for local 
memory multiprocessors can be used on shared memory machines. 
However, a much higher performance can be achieved if the spe- 
cial features of these machines axe fully exploited. In particular, if 
the multiprocessor offers a vector capability, the algorithms out- 
lined in Section III must be revisited. 

For explicit computations on a shared memory multiproces- 
sor, the substructuring approach advocated in Section III may 
be also utilized. Interface data may be either duplicated in the 
shared memory, or treated as a Critical Section (see Benten, 
Farhat and Jordan [15]), — that is, a portion of a code where 
a processor needs to store into a memory location used concur- 
rently by another processor. In the latter case, the processors 
are serialized when processing the interface degrees of freedom. 
For example, while in (1) the quantities u } and u^Uj can be 

evaluated in parallel for all j, the assembly of the results at the 
interface nodes is recursive and requires serialization. A more 
efficient approach on shared memory machines is described by 
Farhat and Crivelli in [16]. Explicit computations are parallelized 
at the element level. Memory contention, and therefore Critical 
Sections, are avoided by processing the elements in an order dic- 
tated by a graph coloring algorithm [16]. Basically, the mesh is 
partitioned into sets of internally disjoint elements, so that vec- 
torization and parallelization are optimized. For example, when 
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applied to the problem shown in figure 1, the coloring scheme 
creates 8 sets of internally disjoint elements. Figure 4 shows the 
elements in set 3 for this example. Within a set of elements, ex- 
plicit computations are performed asynchronously. Synchroniza- 
tion points are required only between the processing of two differ- 
ent sets of elements (8 synchronization points in this case). 


computational savings due to a lower subdomain bandwidth offs- 
the computational requirements of the interface problem (2). - 
that the parallel algorithm based on substructuring is faster, evf 
in serial mode, than a global Choleski decomposition. 



Fig. 4 Internally disjointed elements in set #3 


Fig. 6a Decomposition with X p = 4 and interface 
minimization 


Contrary to popular belief, implicit computations are more 
difficult to optimize on shared memory multiprocessors. To illus- 
trate this fact, we consider the static solutions of the mechanical 
joint and of the Solid Rocket Booster (SRB) (fig. 5) problems, for 
a prescribed loading. Moreover, we assume that ,V p - 4 proces- 
sors are available. The subdivision of both meshes into balanced 
subdomains with a minimum number of interface nodes are de- 
picted in figures 6a-6b. 




Fig. 5 Solid rocket booster 

The discretized mechanical joint contains 456 elements and 
852 nodes. After node-renumbering, the average profile band- 
width is 168. The optimized average profile bandwidth for each 
subdomain is 93. Therefore, the parallel reduction of each sub- 
domain benefits not only from a lesser number of equations to be 
reduced, but also from a smaller bandwidth. For this problem, the 


Fig. 6b Decomposition with N p =: 4 and interface 
minimization 

The discretized SRB model has 10,453 elements, 9.206 nodes 
and 54,870 degrees of freedom. The number of interface nodes 
corresponding to its subdivision into 4 subdomains is 165. After 
node- renumbering, the average profile bandwidth is 310. The op- 
timized average profile bandwidth for the 4 subdomains is 365. 
Clearly, for this problem, reducing the 4 subdomain stiffnesses 
requires a little more floating point operations than reducing the 
global stiffness. Consequently, all the manipulations involved in 
the solution of the interface problem (2) are additional computa- 
tions generated by the substructuring method. Fortunately, the 
interface problem size is only about 2% of the size of the entire 
problem, so that the parallel method is still feasible. However, 
for this problem, and especially if vector processing is available, 
a better performance is achieved with a parallel highly vector- 
ized direct solver. Indeed, the effectiveness of the solution of the 
interface problem (2) comes from sophisticated implementations 
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whose details cannot be described here. For example , the alge- 
braic manipulations involved in the evaluation of the quantity 

KT t K-'Kj t (3) 

do not vectorize well if the sparse data structures and computa- 
tional techniques described by George and Liu in [17], and ad- 
vocated by the author for local memory multiprocessors [11] are 
used, unless special tricks are invoked. On the other hand, for 
the SRB problem, a parallel direct global solver such as the par- 
allel active column solver presented by Farhat and Wilson in [18], 
or a parallel version of the highly vectorized variable band solver 
described by Poole and Overman in [19] are very efficient on a 
parallel/ vector supercomputer (CRAY Y-MP, 8 processors). For 
the SRB problem, this is especially true because the bandwidth 
to number of processors ratio is 310/8 = 38. <5. 

Clearly, the above examples demonstrate that the optimal 
efficiency of a parallel algorithm depends on the underlying hard- 
ware architecture and on the topological characteristics of the 
problem to be solved. 

With the advent of hardware gather- scat ter on most recent 
vector supercomputers, significant progress has been made in im- 
plementing sparse linear equation solvers on these machines (see 
Lewis and Simon [20]). Recently, Aschcraft, Grimes, Lewis, Pey- 
ton and Simon [21] have used the new algorithmic concept of 
a supernodal sparse factorization for implementing a superfast 
sparse linear solver on the CRAY X-MP. The key ideas behind 
the high level of vectorization come from the graph theory model 
of the sparse elimination process which can be found in the book 
of George and Liu [17]. In [22], Simon, Vu, and Yang describe 
a parallel implementation of the supernodal sparse code which 
delivers a performance rate as high as 1.682 GIGAFLOPS. How- 
ever, sparse solvers require a preliminary nodal re-ordering (i.e. 
minimal degree ordering) and symbolic factorization which can 
consume an important amount of CPU time. Therefore, they 
are most effective in nonlinear problems or problems with several 
right hand sides, where the preprocessing phase is done once. 

Parallel I/O developments for finite element simulations, and 
performance measurements on shared memory multiprocessors 
can be found in Farhat, Pramono and Felippa [14]. 

V FE COMPUTATIONS ON A MASSIVELY PARAL- 
LEL PROCESSOR 

The CONNECTION MACHINE is probably the only massively 
parallel processor that is now commercially available. It consists 
of two parts: a front end computer (VAX, SYMBOLICS, SUN), 
and a 64K processor hypercube ( 65,536 single bit processors). The 
front end computer provides instruction sequencing and program 
development and has the ability to address any location in the 
hypercube distributed memory. The hypercube system provides 
number crunching power. 

Recently, Farhat, Sobh and Park [23, 24] have investigated 
massively parallel transient finite element explicit computations 
on the CONNECTION MACHINE. Preliminary results can be 


found in [23] and more detailed information in [24]. In general, 
it has been found that this highly parallel processor can outper- 
form vector supercomputers on explicit computations, but not on 
implicit ones. Several features distinguish the CONNECTION 
MACHINE from earlier hypercubes. On the hardware side, we 
note the impressive number crunching power and the fast parallel 
I/O capabilities. On the software side, we note the virtual proces- 
sor concept, which is somehow the dual of the well-known virtual 
memory concept. Mesh decomposition and processor- to-element 
mapping are the two fundamental keys for efficient massively par- 
allel finite element computations. A given finite element mesh is 
partitioned into 16-element subdomains which correspond to the 
16-processor chips of the CONNECTION MACHINE. This par- 
titioning is carried out in a way that minimizes the number of 
nodes at the interface between the subdomains. As a result, only 
those processors which are mapped onto finite elements at the pe- 
riphery of a subdomain communicate with processors packaged on 
different chips. Moreover, this partitioning is such that the con- 
nectivity bandwidth of the resulting subdomains is large enough 
to allow an efficient use of the 12 interchip wires. The mapping 
algorithm attempts at reducing the distance information has to 
travel through the communication network. In essence, it searches 
iteratively for an optimal mapping through a two-step minimiza- 
tion of the communication costs associated with a candidate map- 
ping (see Farhat [25]). We summarize herein the basic conclusions 
reported in [23, 24]. The processor memory size of 64 Kbits pe- 
nalizes high order elements. Three-dimensional and high order 
elements induce longer communication times. Mesh irregularities 
slow down the computation speed in many ways. The Data Vault 
is very effective at reducing I/O time. The Frame Buffer is ideal 
for real-time visualization. Finally, the virtual processor concept 
outperforms the substructuring technique on the CONNECTION 
MACHINE. 


VI PERFORMANCE EXAMPLES 

The speed-up and MFLOP rates reported in this section include 
all phases of the finite element analyses. A pair of (+,*) is counted 
as 2 flops. 

To illustrate the surgeon approach to parallel /vector finite el- 
ement computations, we report on the solution of three different 
problems on three different multiprocessors. First, we consider a 
modal analysis of the simplified space station model shown in fig- 
ure 7. The finite element mesh comprises 384 nodes, 1264 beam 
elements and 2304 degrees of freedom. Since this is rather a small 
problem, we consider the use of an Intel iPSC with 16 processors 
and 4 Mbytes of available memory. After node- renumbering, the 
average profile bandwidth for this problem is DO. We select not to 
use a global parallel direct solver to carry out implicit computa- 
tions, because it would allow only 6 columns of the stiffness matrix 
to be assigned to one processor, which would make interprocces- 
sor communications dominate local computations. Therefore, we 
select an approach based on the substructuring technique outlined 
in Section III. The mesh is decomposed into 16 balanced subdo- 
mains, each containing approximately 79 elements. The size of 
the interface problem is 672 (112 nodes). Our parallel algorithm 
for eigenvalue extraction and modal superposition on a hypercube 
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architecture is described in [13]. The number of extracted modes 
is 200. The performance results for this analysis on the iPSC are 
reported in table 1. 



Fig. 7 Space station structural model 


Table 1 Modal analysis on a 16-processor iPSC 

Space station structural model 
2,304 d.o.f - 200 modes 


Phase 

Speed-up 

MFLOPS 

Forming K and M 

15 

0.6 

Factoring K 

12 

0.5 

Generating Lanczos vectors 

12 

0.5 

Extracting 200 frequencies 

14 

0.4 

Computing 200 mode shapes 

13 

0.5 


Next, we consider the transient response of a more detailed 
space station model to perturbations induced by shuttle docking. 
The finite element model for this analysis incorporates 7596 2- 
node beam elements, 572 4-node shell elements, 24 3- node rigid el- 
ements, 9802 nodes and 58,812 degrees of freedom (fig. 8). Given 
the size of this problem, we select to run it on an 8K CONNEC- 
TION MACHINE using the parallel central difference algorithm 
[20]. Table 2 summarizes the measured performances for compu- 
tations and I/O manipulations. The latter correspond to dumping 
at each time step the computed displacements, velocities, accel- 
erations, stresses and strains onto the front-end. The reported 
performances are scaled to the full 64K processor configuration 
(see [1] for justifications). For this problem, the Data Vault im- 
proves I/O by a factor of 1307! 



Fig. 8 Detailed space station model 


Table 2 Transient analysis on the Connection Machine 

Detailed space station structural model 
58,812 d.o.f - 2000 time integration steps 


Phase CPU time MFLOPS 

(using C m ) 


Mesh decomposition 

3 

secs 

- 

Data loading in the CM2 

41 

secs 

- 

Equation of Motion Solving 

4500 

secs 

340 

Computation time 

2500 

secs 

665 

Communication time 

2000 

secs 

- 

I/O through front end 

18,300 

secs 

- 

I/O through data vault 

14 

secs 

- 


Finally, we consider the static analysis of the SRB on a CRAY 
Y-MP with 8 processors. Following the reasoning of Section IV, 
we select to perform the factorization of the stiffness matrix us- 
ing a global parallel direct algorithm. For this purpose, we have 
developed a parallel/ vector version of the highly vectorized direct 
solver described in [19]. The measured performances for 1, 2 and 
4 processors are tabulated below (table 3). No results are avail- 
able for the case N p — 8 because the author could not arrange for 
a dedicated time on the CRAY Y-MP 

The SRB problem was also solved in [22] using the supemodal 
sparse factorization. The corresponding results are displayed in 
table 4. It is interesting to note that while the sparse factorization 
is twice as fast as the variable band solver on a single CPU, both 
algorithms become comparable on 4 CPUs. Note also that for 
the SRB problem, it appears that the supemodal code does not 
parallelize well. 
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Table 3 Static analysis on the CRAY Y-MP 
SRB structural model - 54,870 d.o.f. 


Number of processors 

CPU time 

Speed-up 

MFLOPS 

1 

39 secs 

1 

235 

2 

19.79 secs 

1.97 

464 

4 

10 secs 

3.90 

918 

8 

NA 

NA 

NA 
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Table 4 Static analysis on the CRAY Y-MP 

SRB structural model - 54,870 d.o.f. 
Supernod&l Sparse Factorization 


Number of processors 

CPU time 

Speed-up 

MFLOPS 

1 

20.21 secs 

1 

231.71 

2 

13.12 secs 

1.54 

355.79 

4 

9.53 secs 

2.12 

491.45 

6 

8.53 secs 

2.37 

548.90 

8 

8.12 secs 

2.49 

578.08 


CONCLUSIONS 

In summary, the choice of a parallel finite element algorithm 
should be dictated by the multiprocessor to be used and the prob- 
lem to be solved. On local memory MIMD (Multiple Instruction 
Multiple Data) parallel processors, the substructuring technique 
is recommended for both implicit and explicit computations. On 
shared memory multiprocessors, the decision is more difficult. If 
the bandwidth of the problem is small, say only 5 times the num- 
ber of available processors, the substructuring technique is still 
recommended, unless the bandwidth of each subdomain is not 
lower than that of the global problem. Otherwise, a global parallel 
solver is advocated. In the case where vector processing is avail- 
able, special data structures and computational orderings must be 
used in order to fully exploit the vectorization capabilities. The 
analyst must realize that the potential speed-up due to intercon- 
necting a few vector processors cannot compete with the speed-up 
due to the vector capabilities of a single processor. Finally, mas- 
sively parallel processors are just emerging. The CONNECTION 
MACHINE can outperform vector supercomputers when explicit 
computations are utilized. 

While most portability problems on serial machines are due 
to subtleties in compilers and high-level languages, parallel com- 
puters will face the additional burden of algorithmic portability . 
Currently, the only portable parallel code is the one which is 
driven by an analyzer which takes for input the problem to be 
solved and the multiprocessor to be used, and outputs the switch 
for the right parallel algorithm to be invoked. 
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SUMMARY 

I/O issues in finite element analysis on parallel processors arc addressed. Viable solutions for both local and 
shared memory multiprocessors arc presented. The approach is simple but limited by currently available 
hardware and software systems. Implementation is carried out on a CRAY-2 system. Performance results are 
reported. 

1. INTRODUCTION 

Several parallel processor projects have already resulted in commercial multiprocessors (iPSC, 
AMETEK.. NCUBE, Connection Machine, Encore Multimax, Sequent, ALLIANT FX/8, CRAY 
X-MP. CRAY-2, etc.). These machines cover a broad spectrum in terms of three factors: (a) 
granularity, ranging from 2 to 65,536 processors, (b) peak performance, from 09 to 20.000 M flops 
and (c) cost, from SO- 125 M to S10 M. Other projects are still underdevelopment worldwide (GF- 
II NYU/IBM, SUFRENUM, Myrias. etc., see Reference 1 for details). Some numerical 
algorithms have been revised, and some completely redesigned, for implementation on these 
multiprocessors. 2 

Solid mechanics and structural analysis are important major application areas for parallel 
computing. This is reflected by the continuously increasing number of publications on this topic 
over the last few vears. An extensive list of references on finite element computations and 
supcrcomputing may be found in Reference 3. In these references various aspects of the subject, 
such as parallel elcment-by-clement procedures and linear solvers have been investigated, and 
implementation schemes have been proposed and assessed. However, no attempt has been made 
to address, investigate and/or experiment on parallel I/O. 

It is very well known that I/O manipulations can easily dominate the execution time of a finite 
element code. Hence, speeding up these manipulations through parallel processing should be of 
primary concern. This paper attempts to achieve this goal. Section 2 summarizes the occurrence of 
I/O in finite element computations. Section 3 reviews the basic features of parallel processors and 
emphasizes their I/O capabilities and limitations. In Section 4. two simple approaches for 
handling parallel 1.0 on multiprocessors are proposed. Seetion 5 specializes our views to the 
CRAY-2 supcrmulticomputcr and reports on our ‘hands on' experience with it. Remarks and 
conclusions are offered in Section 6. 


2. I/O IN FINITE ELEMENT COMPUTATIONS 


Realistic finite element modelling of rea 
data spaces which can amount to several 


1 engineering systems involves the handling of very large 
l uiuabvtes of memory. To cope with this, many programs 
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in the general area of solid mechanics and structural analysis use out-of-core data base 
management systems. However, 1,0 traffic between the disk and the processor memory slows 
down the computations significantly and increases even more significantly the overall cost of the 
analysis. 

In a typical finite element analysis, nodal and element data arc retrieved from a storage disk 
before their processing, then stored back on the same storage disk after their processing has been 
completed. Examples include the transfer of nodal point co-ordinates, elemental mass and stiffness 
matrices in clement-by-elemcnt computational procedures, and of history response arrays in time- 
stepping algorithms for linear and non-linear dynamics. Other examples include the movement, 
into core and out of core, of blocks of an assembled stiffness or mass matrix in original or factored 
form, and the output on disk of the final results of an analysis. Table I is borrowed from Reference 
4. It summarizes the comparative elapsed times for CPU and I/O on a Vax 1 1/780 of an analysis of 
a cylindrical tube with a viscoplastic behaviour. The frontal method/ which is known to be I/O 
bound, was used for the solution phase. Data transfers were carried out through Fortran I/O. 

Clearly, the performance results reported in Table I underline the potential of I/O for 
bottlenecks in finite element computations. Speeding up all the computational phases through 
parallel processing is certainly an important issue. However, reducing the amount of time spent in 
data transfers can become even more of an issue. 


3. ARCHITECTURE AND HARDWARE 

Recently, several parallel computers have arrived on the scene with a variety of different 
architectures. These generally can be described through three essential elements, namely, 
granularity, topology and control: 

• Granularity relates to the number of processors and involves the size of these processors. A 
fine-grain multiprocessor features a large number of usually very small and simple pro- 
cessors. The Connection Machine (65,536 processors) is such a massively parallel super- 
computer. NCTJBE’s 1024-node model is a comparatively medium-grain machine. On the 
other hand, a coarsc-grain supcrmultiprocessor is typically built by interconnecting a small 
number of large, powerful processors --usually vector processors. CRAY X-MP (4 pro- 
cessors), CRAY-2 (4 processors) and ETA- 10 (8 processors) arc examples of such super- 
multiprocessors. 

• Topology refers to the pattern in which the processors arc connected and reflects how data 
will flow. Currently available designs include hypercube arrangements, networks of busses 
and banyan networks. 

• Finally, control describes the way the work is divided up and synchronized. 


Table I. Comparison of CPU and I/O costs for an FF analysis on a Vax 

1 1 /780 


Phase 

CPU (sec) 

I/O (see) 

Integration of constitutive equations 

IK-28 

41-00 

Assembly of external forces 

005 

000 

Assembly of \iscopla>lsi forces 

13 oo 

100-00 

Solution 

2 75 

3600 

Overall 

>S |N 

177 00 
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Another important architectural distinction, and one that is most relevant to our effort in this 
paper, is that which characterizes memory organization. In shared memory systems, all processors 
access the same (global) large memory system. These multiprocessors are usually coarse-grained 
because the bus to memory saturates and/or becomes prohibitively expensive above a few 
processors. On the other hand, in local memory systems each processor can access only its own 
(local) memory. Independent processors communicate by sending each other messages. It appears 
that parallel computers in this class are easier to scale to a large number of processors. 

Distinguishing only between shared and local memory systems docs not give a complete picture 
of the problems that one may face when programming parallel processors. Granularity and 
control also have their influence. The Connection Machine (65,536 processors) and Intel’s 
hypercube iPSC ( 1 28 processors) are both local memory systems. However, the former is an SI M D 
(single instruction multiple data streams) machine where a single program executes on the front 
end and its parallel instructions are submitted to the processors. The latter is an MIMD (multiple 
instruction multiple data streams) parallel processor where separate program copies execute on 
separate processors. The granularity of a parallel processor, which seems to affect other 
architectural elements, substantially affects the computational strategy and parallel I/O, as will be 

shown. , 

Multiprocessors with any of the above architectures have the capability to substantially speed 
up operations in scientific applications. However, I/O is still their Achilles heel. Before discussing 
parallel I/O strategies and their implementations, we mention that, at the time of writing this 
paper and to our best knowledge, only a few systems offer parallel I/O capabilities. These include 
NCUBE at one extreme, with up to 1024 processors and their small local memories, and CRAY-2 
at the other, with four vector processors and a large shared memory. Parallel disk I/O capabilities 

are also available on the Connection Machine. 

On NCUBE, each node (processor) has a direct connection to an I/O board through one or the 
system I/O channels, so that parallel disk access is possible. Generally speaking, on local memory 
multiprocessors a bundle of processors may be assigned a local disk through a dedicated I/O 

channel. . „ 

On CRAY-2, multitasking I/O is possible on a limited basis. Different tasks can perform I/O 

simultaneously on different files. This is primarily for the following two reasons. 

1 . The non-deterministic nature of task execution limits I/O on the same file by different tasks. 
In other words, problems may arise not only from mapping two distinct hardware processors 
on to the same file, but also from mapping two logical processes on to the same file. Our 
experience has shown that the latter situation complicates even sequential I/O on most 
shared memory multiprocessors (ALLI-ANT FX/8, Encore Multimax, Sequent Balance), 
mainly because of the problem of maintaining consistency in the buffer sizes between distinct 


2. The fact that parts of the support library are critical regions that are protected from 
simultaneous access, and therefore limit the parallelism that one could otherwise exploit. 

The next section presents two simple approaches for parallel disk I/O that arc viable within the 
limitations of the currently available hardware and system software for local memory and shared 
memory multiprocessors. 


4. TWO SIMPLE APPROACHES 

Most of the computational strategies recently proposed for parallel finite element computations 
are based on the principle of divide and conquer: that is, divide the computing task into a number ol 
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subusk' that are either independent or only loosely coupled, so that computations can be made 
on distinct processors with little communication and sharing. For example, if using this strategy 

I s rr , n 1 is ,o ^ anaiysed •** n - * « «« lisj 

subdivided into a set of N p (or a multiple of N p ) balanced substructures . 7 Y 

Depending on .the size of the problem and the granularity of the parallel processor, a 
substructure would contain anywhere from a single element to several thousand of them. Then 
each processor is assigned the task of analysing one— or several— substructure(s). While this 
approach is feasible on most parallel computers, it is especially interesting for local memory 
multiprocessors. Each processor is attributed a simple data structure. Only the node geometry and 
eement properties associated with its assigned substructure are stored within its RAM. In 
addition, formation and reduction of the stiffness matrix for that region require no interprocessor 
communication. Finally, after the displacements have been found, the postprocessing of sub- 
domain stresses can be done concurrently . 8 


Local memory approach 

It is very natural to extend this substructuring idea to achieve parallel I/O in the finite element 
analysis. For example, on local memory multiprocessors, it is tempting to imagine that, in the 
same way that a processor is assigned its own memory, it could be attributed its own set of I/O 
devices (I/O controller, disk drive, etc.) and its own files. Then, each processor would read/write 
the data for Us subdomain from its own files and through its own data base, in parallel with the 



Figure 1. Dividing and conquering a mesh 
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other processors. If assigning an I/O controller and/or a disk drive to each processor is impractical 
and/or impossible, as is probably the case for a fine-grain system, for a cluster of processors it is 
possible. For concreteness, we overview NCUBE’s I/O subsystem for a configuration with 1024 
processors (Figure 2). 

The 1024 computational nodes can be thought of as eight groups of 128 processors each. Each 
group consists of 1 6 clusters of eight directly connected computational nodes. (Recall that where 2 d 
processors are arranged in a hypercube pattern, d of them arc directly connected.) Within a cluster, 
each computational node has 22 direct memory access (DMA) channels. Twenty of these are 
paired into 10 bi-directional communication links and arc used for messages (data transfer) to and 
from direct neighbours. The remaining pair of channels is bundled together with the 127 other 
pairs of the same group and brought through the backplane to one of the I/O slots. This results in 
what is called a system I/O channel. Clearly, an NCUBE system with 1024 computational nodes 
has 8 system I/O channels. Next, an I/O board is interfaced to a backplane to serve the 128 
processors organized into 1 6 clusters of 8 directly connected computational nodes. Another cube 
with 16 nodes is connected to the other side of the I/O board. Each of these nodes has direct access 
to a disk through a private controller. Hence, each of these 1 6 nodes can directly serve one of the 1 6 
clusters of computational nodes. In other words, each computational node within a cluster of eight 
directly connected processors has a direct access to a disk through a dedicated node connected to 
the other side of an I/O board. In summary, the I/O subsystem outlined above supports 1024 
processors with 8 system I/O channels, 128 controllers and 128 disks. It has the potential for a 
minimum I/O speed-up of 128. 

In the following, the words ‘host' and ‘DBM' denote respectively the collection of processors 
serving an I/O board and a generic sequential data base manager. After a given finite clement 



Figure 2 . NCUBE's 1,0 subsystem 











2546 


C FARHAT. E PRAMONO AND C. FEIJPPA 


domain is decomposed, it is grouped into regions R i9 i= 1 128, each containing eight 

(preferably adjacent) subdomains Df\j = I, , . . , 8. A host processor pf 1 is uniquely mapped onto 
each region R t . It is assigned the task of handling I/O manipulations associated with computations 
performed primarily in the eight subdomains within R ( . Basically, since pj 1 is directly connected 
from one side to each of the eight processors pj assigned to subdomains Df\ and from the other to 
its dedicated disk, it can directly transfer data from p/s RAM, j- 1, . . . , 8, to the disk and vice 
versa. This is implemented as follows. Each host processor p[* is loaded with the same program 
driver, which we will call the listener , and the same copy of DBM. The main task of the listener is to 
listen to processor p/s requests for I/O, j = I, ... 8. These requests may be: 

• receive data from pj and store it in disk using DBM; 

• retrieve data from disk through DBM and send it to p ; ; 

• retrieve data from disk through DBM, send it to another host processor p J together with the 
instruction of broadcasting it to a specified number of computational nodes that are directly 
connected to p)\ this particular operation implements potential exchange of data between 
subdomains. 

Consequently, only a small amount of RAM is required on a host processor. It corresponds to the 
storage requirements of an executable listener with its buffer for data transfer and of an executable 
code of a DBM system. Note that the size of a message is not limited by the amount of buffer 
memory available on the host processor but by the amount of memory allocated by the operating 
system for a message passing operation. Hence, a large record of data may need to be split and 
transferred via more than one message. 

On most local memory multiprocessors, a node sends a message to another node (or set of 
nodes) by typically executing a ‘send' system call with the following parameters: (a) a set of 
destination nodes, (b) a process id, (c) a message type, (d) a buffered message or a pointer to the 
message buffer and (e) the length of the message (usually in bytes). Similarly, a node initiates the 
receipt of a message from another process by issuing a ‘receive' system call with parameters 
corresponding to the ‘send' call. In many cases, ‘send' and ‘receive' cannot be coordinated. This is 
the case, for example, when a host processor does not have a priori the schedule of the messages 
that computational nodes will issue during the finite clement analysis. In such situations, a host 
processor can ‘probe' for all pending messages of a specific type and act when a message of a given 
type is available for reception. A computational node program transmits its instructions and data 
to the listener via a message buffer denoted here by BUFFER, and formatted as indicated below: 


KEY 1 

KEY 2 

KEY 3 

INSTRUCTION 

TAGGED DATA 


— 



TAG DATA 


Example: 


4 

23 

l 

4 

STORE IN FILE STIFF’ 


BUFFER[1] 

BUFFER[2] 

BUFFER[3] 


points to the location in BUFFER of the instruction stream to be processed 
then delivered to DBM. 

points to the location in BUFFER of the data stream to be processed then 
delivered to DBM. 

contains the number of continuing messages. 
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On most local memory multiprocessors, messages issued by a node to a same other processor 
are received in the same order that they are sent. Hence, if computational node pi sends to a host 
processor pj 1 an instruction and/or data message followed by two other messages containing the 
remaining of the data, p ■ receives first the instruction tailed with the first part of the data, then the 
rest of the data. However, problems may occur if two different computational nodes p‘ and p k each 
send a set of continued messages to the same host processor p[\ In this case, the host processor 
might receive the messages in disorder. To eliminate ambiguity, the logic of the listener is 
implemented as following: 

• it receives a first message, identifies its type and the number of continuation messages; 

• it probes for pending continuation messages of the same type, receives and processes them 
(pending messages of a different type are queued by the operating system); 

• it listens to another starting message. 

Next, we describe another approach, this one for multiprocessors with shared memory. 


Shared memory approach 

It is possible to simulate a local memory multiprocessor with a shared memory one, by 
partitioning the global memory into locations each fetched always by the same processor. 
Consequently, the approach presented above for parallel I/O on local memory multiprocessors 
identically applies to shared memory machines. However, we see three reasons for adopting a 
different approach on shared memory parallel processors. 

1. Mimicking a local memory system on a shared memory one defeats the purpose of sharing 

information. . , 

2. As described previously, the local memory approach ties a given processor indefinitely to the 
I/O needs of a specific region of the finite element domain. We refer to this as a static 
mapping of a processor onto a subdomain. A key issue in performance of parallel processing 
is load balancing. When the amount of work (computations + I/O) to be performed can be 
predicted for each region of a mesh, it can be evenly distributed among the processors 
throueh a careful partitioning of the geometrical domain and an adequate mapping o the 
processors onto the resulting subdomains. When such predictions are not possible a 
dynamical load balancing algorithm is necessary for optimal performance on parallel 
processors Local mesh refinements in adaptive computations and local material properties 
changes in elastoplastic analyses arc examples of situations where the mapping of a processor 
onto a subdomain needs to be re-defined at each computational step. Note that on local 
memory multiprocessors re-mapping of the processors on the finite element domain implies 
a substantial amount of data transfer between the processors, and what is game wi e 
even redistribution of computations and I/O is lost with interprocessor communications. On 
the other hand, the dynamical re-mapping of the processors of a shared memory system for 
complex finite element computations can be achieved at almost zero overhead cost. 

3 Because of the ability of a processor to reference any location in the global memory, share 
memory multiprocessors provide the programer with a wider variety of parallel strateg '“ 
than do local memory systems. One ought to take advantage of this fact. It will be shownthat 
our approach for parallel I/O in finite element computations on shared memory mult 
processors embeds our approach on local memory machines as a particular case. 

Unlike the previous approach, a single executable version of a sequential DBM is stored in the 
global memory of the multiprocessor. Moreover, there is no need for a listener since all processors 
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can directly access DBM, the I/O library and the disks. However, the core of the computational 
routines needs to be slightly modified to distinguish between global variables, which are shared by 
all the defined processes, and local variables, which have a single name to ease programming but a 
distinct value for each process. Using parallel constructs such as those of The Force 4 * * * * * 10 reduces the 
nature and amount of modifications to one: that of preceding each Fortran declaration of a 
variable by either the word SHARED or the word PRIVATE. While our approach for parallel I/O 
on local memory machines is subdomain oriented, it is purely data oriented on shared memory 
multiprocessors. In the following, we distinguish between four classes of parallel I/O requests. 

/ Synchronous request with private variables (SRPV /. All processes request I/O operations 
simultaneously, each with a private buficr area. Typically, this happens in an SIMD programming 
style, even when the multiprocessor is of the MIMD type. For example, suppose that all of the 
processes have to perform the same amount of identical computations but on distinct sets of data, 
and suppose that these computations are such that out-of-core temporary storage associated with 
each set of data is needed. Here, identity in the instructions calls for synchronous parallel I/O, and 
independence in the data sets calls for private temporary storage. 

2. Synchronous request with shared variables fSRSVJ. All processes request I/O operations 
simultaneously using a common buffer area. For example, consider the previous case with the 
additional assumption that the nature of the computations requires shuffling of the temporary 
data between processes. 

3. Asynchronous request with private variables [ARPV ]. A process requests I/O operations 
independently of another process and with a private buffer area. These requests arc identical to 
those on MIMD local memory multiprocessors. For example, the entire approach described 
earlier for local memory multiprocessors fits into this class of I/O requests. 

4 . Asynchronous request with shared variables [ ARSV J. A process requests I/O operations 
independently of another process using a shared buffer area. 

Clearly, the four classes of I/O request described above cover all the possibilities on a shared 
memory multiprocessor. At this point we introduce the following remarks. 

1. Synchronous and asynchronous refer to the initiation of the processes and not to their 
execution. Two processes can be initiated at the same time but executed at two different 
times, for example, if one processor were tied up by a previous process. 

2. [ARSV] requires that a pointer to the location in the buffer of the starting address for storage 
and/or retrieval of data be carefully computed by its owner process, in order not to destroy 
the information by overlapping the data. 

3. The multiprocessor will take no responsibility for automatically generating synchronization. 
It is entirely the responsibility of the user to make sure that the shared data to be created by 
one process and to be read by another process are available before an [ARSV] is issued. 
Typically, one invokes an explicit synchronization instruction for that purpose. 

Next, we describe a simple parallel I/O manager, PIOM, which copes with our four defined I/O 
requests. First, note that PIOM can handle [ARPV] and [ARSV] exactly as in the sequential case. 

Hence, [SRPV] and [SRSV] arc the requests which call for a modification of a basic sequential 
I/O manager. Moreover, after PIOM recognizes that [SRPV] deals with private variables, it can 
treat it exactly as [ARPV], with the difference that calling processes are responded to in parallel. 
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In other words, [SRPV] is treated as a set of simultaneous [ARPV]. Consequently, the treatment 
of [SRSM] is PIOVTs major task. 

For each file related to an [SRSV], PIOM consults an I/O table. If the request is for storing 
data, PIOM’s logic is as follows: 

(51) it partitions the information into a number of contiguous subsets equal to the number of 
calling processes, each subset containing an equal amount of data. 

(52) for each subset, it computes a pointer to the location in the shared bulTcr where the subset 
data stream begins. 

(53) for each calling process, it creates a corresponding ‘S' I/O process. Each ‘S' I/O process is 
assigned a subset of the data with its pointer. 

(54) it reports in the I/O tabic the total number of created ‘S' I/O processes. For each ‘S' I/O 
process, it specifies the length of its assigned data and their destination on a hardware 
device. 

(55) it fires the ‘S' I/O processes. Each ‘S' I/O process re-partitions its assigned data into a 
number of records that is a multiple of the total number of available processors on the 
machine, then calls DBM independently of another S I/O process. The reason for the 
internal partitioning will become clearer in the remarks which follow. 

On the other hand, if the request is for retrieving data, PIOM's logic becomes: 

(Rl) it retrieves the I/O table corresponding to the file. If the number of calling processes is 
equal to the number of processes registered in the table (the S processes which originally 
stored the file), the inverse logic to the ‘store’ case is followed and the data are retrieved in 
parallel. If not: 

(R2) for each registered ‘S' I/O process, it partitions its subset of information into a number of 
contiguous blocks of data equal to the number of calling processes, each block containing 
an equal amount of data. 

(R3) for each block, it computes a pointer to the location in the shared bulTcr where the block 
data stream begins. 

(R4) for each registered ‘S' I/O process, it creates a number of R' I/O processes equal to the 
number of "calling processes. F.ach ‘R’ I/O process is assigned a block of the subset data 
with its pointer. 

(R5) it fires the R' I/O processes. 

(R6) it follows with the next ‘S’ process to be retrieved. 

Clearly step (S5) and steps (R2) to (R6) allow for a file that was written in parallel using p 
processes to be read in parallel using p * processes, where p * is different from p. In this case, the 
retrieval of the file is carried out in p waves, each of a degree of parallelism equal to p . The overall 

logic is summarized in Figure 3. . 

In order to illustrate the flexibility of this approach, we describe two simple examples. Exampie 
1 illustrates the distinction between the mapping of the processors on the data during I/O and 
during computations. Example 2 illustrates the ability of the approach to handle dynamical load 

balancing algorithms. 


Example I. A block of the stiffness matrix is to be retrieved from disk and factored using four 
processors. An [SRSV] is issued to read the block of the stiffness matrix. The partitioning of the 
data bv PIOM into contiguous subsets is shown in Figure 4(a). After the entire data are retrieve 
in parallel, the processors arc mapped onto the stiffness matrix block in an interleaved fashion 
(Figure 4(b)). Next a call for a parallel active column solver 1 1 is issued. 
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Figure 3. A parallel I/O manager 




Figure 4(a) Mapping for data retrieval 


Figure 4(b). Mapping for compulations 


Example 2. During coloured element-by-element computations, 9 NEB elemental stiffness 
matrices have to be read from disk and processed for computation of residuals. Here again, an 
[SRSV] is issued for parallel retrieval of the data. The mapping of N p processors on the elemental 
stiffnesses is initially prescribed only for the first N p elements of NEB. After that, the elements are 
processed as soon as a processor becomes available. Hence, the question of which element turns 
out to be non-linear and which turns out to remain linear does not affect the load balancing. 
Moreover, another [SRSV] for another set of NEB elements to be read in parallel can be issued as 
soon as a processor is done with its computations and while all the others are still tied up with the 
last N p -\ elements to be processed. 

5. IMPLEMENTATION ON THE CRAY-2 

The CRAY-2 supermultiprocessor is characterized by a global memory of 256 million 64-bit 
words, four background processors and a clock cycle of 4*1 nanoseconds. It is the target machine 
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for our first experiments with parallel I/O. The four background processors can operate 
independently on separate jobs or concurrently on a single problem (CRAY Research Inc. refers to 
this as multitasking). Each processor can independently coordinate the data flow between the 
system common memory and all the external devices across four high-speed I/O channels. 

As stated in Section 3, multitasking I/O is possible on CRAY-2 with the restriction that different 
processes can simultaneously perform I/O only on separate files that are located on different disks. 
The shared memory approach presented in Section 4 is slightly modified to accommodate 
CRAY-2’s limitations. Any file specified by the user is automatically partitioned by PIOM into a 
number of ‘sub-files’ equal to the number of I/O processes. The partitioning and the sub-files 
names are transparent to the user. They are recorded in the I/O table for further I/O processing. 
Three algorithms — chunking, interleaving and interleaving with buffering are considered for 
mapping the data onto the sub-files. 

The blocking algorithm is a straightforward implementation of steps (SI) and (S2) described in 
Section 4. The data to be transferred are partitioned into a number of subsets of contiguous data 
equal either to the number of available disks, or to the number of calling processes, whichever is 
smaller. This algorithm is very fast, but has two main drawbacks: 


• it may not utilize all the available processors for some I/O read requests. For example, 
consider the case where the information to be read corresponds to data that were previously 
written by PIOM on the same physical disk. 


• appending to an existing file may not be efficient. 

The reader should note that the words ‘a file' refer to what is in the user’s mind. PIOM always 
splits ‘the’ file into as many sub-files as there arc available disks. Appending an existing file, and 
reading from an arbitrary location in a file, arc two operations which arc belter handled by the 
interleaving algorithm. Basically, if N d denotes the number of available disks, and D denotes the 
data stream to be processed, this algorithm partitions D into a set of segments S, of arbitrary sizes, 

and assigns each segment S, to disk moil (/, /V d ) (Figure 5). f . 

The interleaving algorithm above requires the I/O manager to be invoked a number of times; 
that number is equal to the ratio of the number of segments divided by the number of disks, d . 
Each time the I/O manager is invoked, it conveys the information segment directly from main 
memory to auxiliary storage or vice versa. Another approach consists of first buffering the 
segments of a given parallel I/O process in an order that reflects their layout in their assigned dis , 
then invoking only once the I/O manager to execute the parallel I/O request. 


I 1 I 2 I 3 1 4 | 5 | 6 | 7 1 8 | 9 | 10 |" I !LJ 

MAIN STORAGE 


r — ; r — : r — = — 1 7T — I 


DISK 1 


1 2 | 5 1 8 I 11 


uior\ £ 

1 3 [ 6 I 9 1 12 



DISK 3 


Figure 5. Interleaving data on disks 
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The practical implementation of the three algorithms described above is carried out with The 
Force , 1 0 a preprocessor which provides a FORTRAN style parallel programming language 
utilizing a set of parallel constructs. [ARPV] and [ARSV] are implemented with regular CALL 
statements to PIOM; each process executes independently of the other its call to a subroutine, 
delivering a different data buffer. [SRPV] and [SRSV] are implemented with the FORCECALL 
executable statement to PIOM: this construct causes the entire processes to jump and execute 
parallel calls to PIOM. In the latter case, the processes’ ids are automatically passed to PIOM. 
Performance results for the three algorithms arc reported in Tables II, III and IV. Tables II and III 
arc associated with a segment size equal respectively to 1 sector (1024 bytes) and 1 track (65536 
bytes). They compare the performance of the three algorithms for a parallel read request consisting 
of retrieving a 24 Mbytes data stream using 2 CPU’s. Wall-clock, system time and user time are 
reported. System time corresponds to the time elapsed in PIOM managing parallelism. User time 
is associated with I/O overhead. 


Table II. Performance results 

Parallel read— Information size = 24 Mb- 

—Segment size = 

1 sector 

Clock 

System 

User 

(sec) 

(sec) 

(sec) 

Chunking 1-222 

7-777E — 5 

3-482E-2 

Interleaving (buffering) 4-885 

2-0565 

2-922E-2 

Interleaving 6-604 

0*5916 

2-523 


Table III. 

Performance results 


Parallel read- -Information 

size = 24 Mb 

—Segment size* 

1 track 


Clock 

System 

User 


(sec) 

(sec) 

(sec) 

Chunking 

1-222 

7-777E-5 

3-482E-2 

Interleaving (buffering) 

4-120 

1 965 

3-917E-2 

Interleaving 

1 159 

9 30E-3 

7-643 E - 2 


Table IV. Speed-up 


Chunking algorithm 



t Process 


(sec) 

Write 

21-603 

Speed-up 

t-0 

Read 

1-165 

Speed-up 

1.0 


Information size = 200 Mb 


l MJUl muuuii ai/.t - 


2 Processes 

3 Processes 

(sec) 

(sec) 

12-208 

10036 

1-77 

2 15 

0-584 

0-390 

1 99 

2-99 
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For a segment size equal to 1 sector, the chunking algorithm is by far the fastest. For this 
example, the number of segments to be processed, which is given by the ratio information 
size, segment size, is such that the interleaving algorithm has a high overhead associated with I/O 
instructions, and the interleaving with buffering algorithm has a high overhead associated with 
PIOM's instructions. 

However, for a segment size equal to 1 track, the interleaving algorithm performs best. This is 
because for the given segment size, fewer segments need to be processed and less time is elapsed in 
I/O instructions. 

The above results provide the user with a guidance for the selection of any of the three 
implemented parallel algorithms. 

Table IV reports the wall-clock time and measured speed-up for parallel rcad/write requests 
using the chunking algorithm. Only three out of the four available CR AY-2 CPU's were activated 
because only three different disks were available. For each case, the size of the data stream to be 
processed was fixed to 200 Mbytes. 

Clearly, very high speed-ups are achieved for both read/write parallel requests. Note, however, 
the pathological performance for the write case with three processors. We have not yet been able 
to justify this particular result. 


6. CONCLUSION 

Finite element analyses are known to be I/O bounded. In this paper, two approaches are presented 
to speed I/O manipulations through parallel processing. The first approach deals with local 
memory MIMD multiprocessors and is based on a substructuring technique. The second 
approach is dedicated to shared memory multiprocessors. It has been implemented and tested on 
a CRAY-2 system with four CPU's. The obtained performance results confirm the potential of 
parallel processing in I/O manipulations. Future work will address I/O operations on the data 
vaults of the Connection Machine (65,536 processors). 
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Abstract. A novel domain decomposition approach for the parallel finite 
element solution of equilibrium equations is presented. The spatial domain is 
partitioned into a set of totally disconnected subdomains, each assigned to an 
individual processor. Lagrange multipliers are introduced to enforce compatibil- 
ity at the interface nodes. In the static case, each floating subdomain induces a 
local singularity that is resolved in two phases. First, the rigid body modes are 
eliminated in parallel from each local problem and a direct scheme is applied con- 
currently to all subdomains in order to recover each particil local solution. Next, 
the contributions of these modes axe related to the Lagrange multipliers through 
an orthogonality condition. A parallel conjugate projected gradient algorithm is 
developed for the solution of the coupled system of local rigid modes components 
and Lagrange multipliers, which completes the solution of the problem. When 
implemented on local memory multiprocessors, this proposed method of tearing 
and interconnecting requires less interprocessor communications than the classi- 
cal method of substructuring. It is also suitable for parallel/ vector computers 
with shared memory. Moreover, unlike parallel direct solvers, it exhibits a degree 
of parallelism that is not limited by the bandwidth of the finite element system 
of equations. 
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1. Introduction 


A number of methods based on domain decomposition procedures have been 
proposed in recent years for the parallel solution of both static and dynamic 
finite element equations of equilibrium. Most of these methods are derived from 
the popular substructuring technique. Typically, the finite element domain is 
decomposed into a set of subdomains and each of these is assigned to an individual 
processor. The solution of the local problems is trivially parallelized and usually 
a direct method is preferred for this purpose. Parallel implementations of both a 
direct (Farhat, Wilson [1]) and an iterative (Nour-Omid, Raefsky and Lyzenga [2]) 
solution of the resulting interface problem have been reported in the literature. 
A number of more original approaches have also been spurred by the advent of 
new parallel processors. Ortiz and Nour-Omid [3] have developed a family of 
unconditionally stable concurrent procedures for transient finite element analysis 
and Farhat [4] has designed a multigrid-like algorithm for the massively parallel 
finite element solution of static problems. Both of these developments relate to 
the divide and conquer paradigm but depart from classical substructuring. 

In this paper, we present a parallel finite element computational method for 
the solution of static problems that is also a departure from the classical method 
of substructures. The unique feature about the proposed procedure is that it 
requires fewer interprocessor communication than traditional domain decompo- 
sition algorithms, while it still offers the same amount of parallelism. Roux [5, 
6] has presented an early version of this work that is limited to a very special 
class of problems where a finite element domain can be partitioned into a set 
of disconnected but non-floating subdomains. Here, we generalize the method 
for arbitrary finite element problems and arbitrary mesh partitions. We denote 
the resulting computational strategy by “finite element tearing and interconnect- 
ing” because of its resemblance with the very early work of Kron [7] on tearing 
methods for electric circuit models. In Section 2, we partition the finite element 
domain into a set of totally disconnected subdomains and derive a computational 
strategy from a hybrid variational principle where the inter-subdomain continuity 
constraint is removed by the introduction of a Lagrange multiplier. An arbitrary 
mesh partition typically contains a set of floating subdomains which induce local 
singularities. The handling of these singularities is treated in Section 3. First, 
the rigid body modes are eliminated in parallel from each local problem and a 
direct scheme is applied concurrently to all subdomains in order to recover each 
partial local solution. Next, the contributions of these modes are related to the 
Lagrange multipliers through an orthogonality condition. A parallel conjugate 
projected gradient algorithm is developed in Section 4 for the solution of the 
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coupled system of local rigid modes components and Lagrange multipliers, which 
completes the solution of the problem. Section 5 deals with the preconditioning of 
the interface problem in order to speed up the recovery of the Lagrange multipli- 
ers. Section 6 emphasizes the parallel characteristics of the proposed method and 
Section 7 contrasts it with the method of substructures. Section 8 discusses some 
important issues related to the partitioning of a given finite element mesh. Fi- 
nally Section 9 illustrates the method with structural examples on the distributed 
memory hypercube iPSC (32 processors) and the shared memory parallel/vector 
CRAY-2 system (4 processors), and Section 10 concludes the paper. 


2. Finite element tearing and interconnecting 

Here we present a domain decomposition based algorithm associated with a hy- 
brid formulation for the parallel finite element solution of the linear elastostatic 
problem. However, the method is equally applicable to the finite element solu- 
tion of any self-adjoint elliptic partial differential equation. For the sake of clarity, 
we consider first the case of two subdomains, then generalize the method for an 
arbitrary number of subdomains. 


The variational form of the three-dimensional boundary-value problem to be 
solved goes as follows. Given / and h, find the displacement function u which is 
a stationary point of the energy functional: 


jm 


-a(v, v) - (», /) - (t>, h)r 


where 


a(v, w) 

(v,f) 

(v,h) r 


Jn 

f Vifi 

Jci 

/ Vihi 

Jr „ 


jkiw<k,l) 

dfi, 

dT 


( 1 ) 


In the above, the indices i,j,k take the value 1 to 3, + Uj ( ,)/ 2 and 

Vij denotes the partial derivative of the i — th component of v with respect to 
the j — th spatial variable, Cijki are the elastic coefficients, denotes the volume 
of the elastostatic body, T its piecewise smooth boundary, and T h the piece of T 
where the tractions hi are prescribed. 
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If fi is subdivided into two regions fij and Q 2 (fig. 1), solving the above 
elastostatic problem is equivalent to finding the two displacements functions ttj 
and u 2 which are stationary points of the energy functionals: 


JiM = 

-(ux,/) ni -(ui,/i) ri 

J 2 M = 

^a(v 2 , v 2 )n i - (v 2 , fh 3 - (t> 2 , h) r 2 

where 


a(ui,^i)ni = 

/ V l(i,j) C ijkl w l(kJ) dQ 

J(h 

a(u2,«>2)n 2 = 

/ t? 2(»,j) c «ifc/ u, 2(t.i) dQ, 

JQt 

(wi Jhi = 

[ v u fi dQ 

JCh 

(v 2 Jh, = 

1 V2ifi dQ 
Jch 

(vi,h) ri = 

1 vnhi dT 

J r M 

(u2,Mr 2 = 

1 V 2 i h , dT 

Jr hi 


and which satisfy on the interface boundary T/ the continuity constraint: 


u\ = U 2 on Vj 


( 3 ) 
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FIG. 1 Decomposition in two subdomains 

Solving the two above variational problems '2) with the subsidiary continuity 
condition (3) is equivalent to finding the saddh point of the Lagrangian: 

= Ji(vi) + J 2 (i i) + (i>i - u 2 ,/i)r, 

where 

— v?) dr 

— that is, finding the two displacement fields ui and u 2 and the Lagrange mul- 
tiplier A which satisfy: 

J*(ui,U 2 ,/x) < A) < J*(v i,v 2 ,\) (5) 

for any admissible v\, v 2 and p.. Clearly, the left inequality in (5) implies that 
(«i — u 2 ,p.)r, < (^1 — u 2 ,A)r /? which imposes that (uj — u 2 ,/j.) rv = 0 for any 
admissible /i and therefore itj = u 2 on T/. The right inequality in (5) imposes 
that Ji(ui) + J 2 (u 2 ) < Ji(vi) + J 2 (v 2 ) for any pair of admissible functions (iq, v 2 ). 


(v 1 -v 2 ,fi) Vl = J 



5 


This implies that among all admissible pairs (t>i,v 2 ) which satisfy the continuity 
condition (3), the pair (ui,u 2 ) minimizes the sum of the energy functionals J i 
and J 2 defined respectively on and fl 2 . Therefore, u x and u 2 are the restriction 
of the solution it of the non-partitioned problem (1) to respectively and fl 2 . 
Indeed, equations (4) and (5) correspond to a hybrid variational principle where 
the inter-subdomain continuity constraint (3) is removed by the introduction of 
a Lagrange multiplier (see, for example, Pian [8]). 

If now the displacement fields iti and u 2 are expressed by suitable shape 
functions as: 


iti = Nui and u 2 = Nu 2 

and the continuity equation is enforced for the discrete problem, a standard 
Galerkin procedure transforms the hybrid variational principle (4) in the fol- 
lowing algebraic system: 

KiUi = fi ■+■ B{ A 

K 2 u 2 = f 2 - B JA (6) 

BiUi = B 2 u 2 


where K j, u and f J? j = 1,2, are respectively the stiffness matrix, the displace- 
ment vector, and the prescribed force vector associated with the finite element 
discretization of Clj. The vector of Lagrange multipliers A represents the interac- 
tion forces between the two subdomains and fl 2 along their common boundary 
F/. Within each subdomain f! j, we denote the number of interior nodal unknowns 
by n a - and the number of interface nodal unknowns by nj. The total number of 
interface nodal unknowns is denoted by nj. Note that nj = n\ = n 2 * n the partic- 
ular case of two subdomains. If the interior degrees of freedom are numbered first 
and the interface ones are numbered last, each of the two connectivity matrices 
Bi and B 2 takes the form: 


Bj = [O, Ij ] 7 = 1,2 

where O j is an nj x nj null matrix and Ij is the nj x nj identity matrix. The 
vector of Lagrange multipliers A is n/ long. 

If both Ki and K 2 are non-singular, equations (6) can be written as: 
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(7) 


(B.K^Bf + BaKj-'B^A = B 2 K^f 2 - 

u a = Kf^fi + BfA) 
u 2 = K^fa-B^A) 


and the solution of (6) is obtained by solving the first of equations (7) for the 
Lagrange multipliers A, then substituting these in the second of (7) and back- 
solving for Ux and u 2 . 


For an arbitrary number of subdomains fly, the method goes as follows. 
First, the finite element mesh is “tom” into a set of totally disconnected meshes 

(fig- 2). 



FIG. 2 Finite Element Tearing 

For each mesh, the stiffness matrix K y and the vector of prescribed forces fy are 
formed. Next, for each fly, a set of boolean symbolic matrices By are set up to 
interconnect the mesh of fly with those of its neighbors A*. In general, By is 
nj x (nj + ny) and has the following pattern: 
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Oi0\fc) 


c ; 


o 2 (i,fc)j 


where Oi(j, k ) is an mi(j, k ) x (n*-f nj) zero matrix, 0 2 (j, k ) is another m 2 0', &) x 

( n j + n i) zero matrix ^ c j is an m c(j,k) x (nj + nj) connectivity matrix, 
m c(j,k) is the number of Lagrange multipliers that interconnect Qj with its 
neighbor 12*, and mi(j, k ) and m 2 (j, k) are two non-negative integers which satisfy 
mi(j, k) + m c (j , fc) + m 2 (j, fc) = n/. The connectivity matrix C* can be written 
as: 


Cj = (o 3 (>,fc) i) 0*0, *)] 


where 0 3 (j, fc) is an m c (j, k ) x m 3 (j, k) zero matrix, I* is the m c (j, /:) x m c (j, jfc) 
identity matrix, 0 4 (j, k ) is another m c (j, k ) x m 4 (j , fc) zero matrix, and m 3 (j, It) 
and m 4 (j,k) are two non-negative integers which verify m 3 (j,k ) + m c (j,k ) + 
m 4 0, &) = nj + nj. If o-j and A 7 ', denote respectively the number of subdomains 
12 jt that are adjacent to 12 ; and the total number of subdomains, the finite element 
variational interpretation of the saddle-point problem (4) generates the following 
algebraic system: 


k=CLj 


= tj + E B f x y = i , jv 




(S) 


BjUy = Bju* j = l,N a and 12 jt connected to Clj 


If Kj is non-singular for all j = 1 ,N a , the solution procedure (7) can be 
extended to the case of an arbitrary number of subdomains. However, the finite 
element tearing process described in this section may produce some “floating” 
subdomains £2/ which are characterized by a singular stiffness matrix K /. When 
this happens, the above solution algorithm (7) breaks down and a special com- 
putational strategy is required to handle the local singularities. 

We refer to the computational procedure presented herein as the method 
of finite element tearing and interconnecting because of its resemblance with 
Kron’s tearing method [7] for electric circuit models. We also note that the 
utility of Lagrange multipliers specifically for domain decomposition has also 
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been previously recognized by other investigators (Dihn, Glowinsky and Periaux 
[9], Dorr [10]). 


3. Handling local singularities 

Again, we focus on the two-subdomain tearing. The extrapolation to N s > 2 is 
straightforward. For example, suppose that corresponds to a cantilever beam 
and that flj and are the result of a vertical partitioning (fig. 3). 



FIG. 3 Decomposition resulting in a singular subdomain 

In this case, Ki is positive definite and K 2 is positive semi-definite since no 
boundary condition is specified over fl 2 - Therefore, the second of equations (6): 

K 2 u 2 = f 2 — B^A (9) 
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requires special attention. If the singular system (9) is consistent, a pseudo-inverse 
of K 2 can be found, — that is a matrix K 2 which verifies K 2 K*K 2 = K 2 , and 
the general solution of (9) is given by 

u 2 = K+(f 2 -B^A) + R 2 a (10) 


where R 2 is an (rij+nj) xn£ rectangular matrix whose columns form a basis of the 
null space of K 2 , and a is a vector of length nj. Physically, R 2 represents the rigid 
body modes of fl 2 and a specifies a linear combination of these. Consequently, 
we have n£ < 6 for three-dimensional problems, and < 3 for two-dimensional 
problems. Substituting (10) into (7) leads to: 


(B 1 K 1 - 1 B^ + B 2 K+B^)A = — BiK^f! + B 2 (K+f 2 + R 2 a) 

ui = Kr^fi + BfA) (11) 

u 2 = K+(f 2 - B^A) + R 2 a 


It should be noted that: 

1. because Bj is a boolean operator, the result of its application to a matrix or 
vector quantity should be interpreted as an extraction process rather than a 
matrix-matrix or matrix-vector product. For example, B 2 R 2 is the restric- 
tion of the local rigid modes R 2 of fl 2 to the interface unknowns. In the 
sequel we adopt the notation: 

R 2 = B 2 R 2 

2. the pseudo- inverse does not need to be explicitly computed. For a given 
input vector v, the output vector K^v and the rigid modes R 2 can be 
obtained at almost the same computational cost as the response vector K] -1 v, 
where Ki is non-singular (see appendix A). 

3. system (11) is under-determined. Both A and a. need to be determined before 
Ui and u 2 can be found, but only three equations are available so far. 

Since K 2 is symmetric, the singular equation (9) admits at least one solution 
if and only if the right hand side (f 2 — B^A) has no component in the null space 
of K 2 . This can be expressed as: 


R^(f 2 -B^A) = 0 


( 12 ) 
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The above orthogonality condition provides the missing equation for the complete 
solution of (11). Combining (11) and (12) yields after some algebraic manipula- 
tions: 


Fj -R-al 

A' 


B 2 K+f 2 - BjK^fr 

.-r 2 t O . 

a 


-R^f 2 


Ul = Kr^fi + BfA) 

u 2 = K+(f 2 - B*A) + R 2 a (13) 

where 

F j = (B a Kf 1 Bj’ + B 2 K+Bf) 


Clearly, F/ is symmetric positive definite and R 2 has full column rank. Therefore, 
the system of equations in (A, cc) is symmetric and non-singular. It admits a 
unique solution (A, a) which uniquely determines Ui and u 2 . 

It is important to note that since < 6, systems (13) and (7) have almost 
the same size. For an arbitrary number of subdomains N s of which N / are 
floating, the additional number of equations introduced by the handling of local 
singularities is bounded by 6Nf. For large-scale problems and relatively coarse 
mesh partitions, this number is a very small fraction of the size of the global 
system. On the other hand, if a given tearing process does not result in any 
floating subdomain, a is zero and the systems of equations (13) and (7) are 
identical. 

Next, we present a numerical algorithm for the solution of (13). 


4. A preconditioned conjugate projected gradient algorithm 
Here we focus on the solution of the non-singular system of equations: 


F / 

-R 2 ' 


A 


B 2 K+f 2 -BjK^fj' 

r£ t 

O 


a 


-R^f 2 


where 


F/ = BiK-'Bf + BjK+B?' 
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We seek an efficient solution algorithm that does not require the explicit assembly 
ofF/. 


The solution to the above problem can be expressed as: 

A = -H(B 2 K^f 2 -B 1 Ki' 1 fi) + TR^f2 
a =T T (B 1 Kf 1 f 1 -B 2 Kjf 2 )-UR^f 2 

where 

H = F7 1 — F7 1 R 2 u - 1 R2 T F7 1 
t = F 7 1 r£u ~ 1 
u = -r^ t F7 1 r^ 


(15) 


As written in (15), this solution procedure is not recommended because it 
requires either the evaluation of the inverse of Fj, or the nested solutions of two 
linear systems involving F / and Rj F7 1 R 2 - It is noted by Fletcher [11] that if 
two matrices S and Z are computed such that: 


S r R' = I 
Z r R^ = O 


(16) 


an alternative representation of the solution to (14) is given by: 

A =-H(B 2 K^f 2 -B 1 K7 1 fi) + TR^f2 
a =T r (B 1 K7 1 f 1 -B 2 K^f 2 )-UR^f 2 

where 

H = Z(Z T F/Z) -1 Z T 
T = S-HF/S 
U = S r F/HF/S — S t F/S 


(17) 


which does not require the explicit assembly of Fj if a suitable iterative scheme is 
chosen for solving all the temporary systems involving the quantity (Z F/Z) . 
Still, the above solution procedure is not feasible because it requires the compu- 
tation of S and Z — typically via a QR factorization of some matrix involving 
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r; in], and the iterative solution of too many temporary systems before A and 
a can be obtained. 


Clearly, the nature of F/ makes the solution of (14) inadequate by any tech- 
nique which requires this submatrix explicitly. This implies that a direct method 
or an iterative method of the SOR type cannot be used. The only efficient method 
of solving (14) in the general sparse case is that of conjugate gradients, because 
once Kj and K2 have been factorized, matrix-vector products of the form F/v 
can be performed very efficiently using only forward and backward substitutions. 
Unfortunately, the Lagrangian matrix: 


L 


F / -Rfl 

-R i T O . 


is indefinite so that a straightforward conjugate gradient algorithm cannot be 
directly applied to the solution of (14). However, the conjugate gradient iteration 
with the projected, gradient (see, for example, Gill and Murray [12]) can be used 
to obtain the sought-after solution. In order to introduce the latter solution 
algorithm, we first note that solving (14) is equivalent to solving the equality 
constraint problem: 


minimize $(A) 

ji 

subject to R 2 A 


iA T F 7 A + (BjK^fa - B 2 K+f 2 ) T A 
Rif 2 


(18) 


Since F/ is symmetric positive definite, a conjugate gradient algorithm is most 
suitable for computing the unique solution to the unconstrained problem. There- 
fore, this algorithm will converge to the solution to (18) if and only if it can be 

modified so that the constraint Rj A = R^f 2 is satisfied at each iteration. This 
can be achieved by projecting all the search directions onto the null space of R^. 
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The result is a conjugate gradient algorithm with the projected gradient. It 
is of the form: 

Initialize 

Pick A (0 ^ such that R 2 A^ = R 2 ^2 

r (0) = (B 2 K^f 2 -BiK^fi) 

Iterate k = 1,2, ... until convergence 

pW = r (i-l) T r (i-l) /r ( t-2 )T r (t-2) (/? (1) =0) 

s (k) = r (k-i) + p(k) s (k-i) (s (D =r (o)) 

s<*> = [I — R 2 (R 2 T R 2 ) -1 R-2 T ] s(fc) 

7 (fc) _ r (t-i) r r ( fc_1) /s (fc)T F/s (,:) 

A (fc) _ A (*-D +7 W s ( fc ) 

r (A') = r (fc-D _ 7 ( fc )F/s (fc) 


(19) 


( 20 ) 


A fast scheme for finding a starting A (0) which^ satisfies the constraint 
R 2 T A^ = R^f 2 is given in appendix B. Clearly, R 2 = 0 for all k > 1. 
Therefore, R^A (fc) = R 2 T A (0) which indicates that the approximate solution 
A (fc) satisfies the linear equality constraint of problem (14) at each iteration k. 
It is also important to note that within each iteration, only one projection is 
performed. This projection is relatively inexpensive since the only implicit com- 
putations that are involved are associated with the matrix R 2 R 2 which is at 
most 6x6. This matrix is factored once, before the first iteration begins. Except 
for this small overhead, algorithm (20) above has the same computational cost as 
the regular conjugate gradient method. 

After A is found, the rigid body mode coefficients are computed as. 
a = (R 2 T R 2 ) _1 (F/A — B 2 K 2 J f 2 + BjKj^fi) 
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For an arbitrary number of subdomains N s of which Nf axe floating, the 
equality constraint is: 


R /T A 


mri 


'Ri r 

0 

0 


■fl' 

I 

ja . 

h' 

1 

A = R r f = 

0 

... 

0 




0 

0 

R/ t . 


1 

1 


Only those columns of R.J which operate on Lagrange multipliers that are 
associated with T/ fl are non-zero. The projection matrix is P = [I — 

r 7 (r/ r/) -1 r / 7 ] where R 7 R 7 is generally banded of dimension at most equal 
to 6 Nf. The banded structure of P is determined by the subdomains interconnec- 
tivity. If for practical reasons this banded structure is not exploited, the number 
of three-dimensional floating subdomains should be kept as small as possible, 
say less than thirty two, which implies that the proposed computational method 
would be suitable only for coarse and medium grain multiprocessors. 


5. Preconditioning the interface problem 


As in the case of the conjugate gradient method, the conjugate projected gra- 
dient algorithm is most effective when applied to the preconditioned system of 
equations. It should be noted that even in the presence of floating subdomains, 
only F / needs to be preconditioned and not the global Lagrangian matrix L. In 
the case of two subdomains, F / can be written in matrix form as: 


F/ = [ Bj 


B 2 ] 


O 


-i 



( 21 ) 


where K ; 1 , j = 1,2, is replaced by K* if Clj is a floating subdomain. The 
objective is to find an approximate inverse P^ 1 of F / that: (a) does not need to be 
explicitly assembled (especially since F / is not explicitly assembled), and (b) that 
is amenable to parallel computations. The matrix P is then the preconditioner. 
Equation (21) above suggests the following choice for PJ" 1 : 


P 7 1 


= [B] 


fK, Ol 


-Br 

0 

K 

to 

1 


i B 2 T J 


( 22 ) 
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At each iteration k, the preconditioned conjugate projected gradient algorithm 
involves the solution of an auxiliary system of the form: 

P/z (k) = r ( *> (23) 

where r ^ is the residual at the k — th iteration. The particular choice of PJ 1 
given in (22) offers the advantage of solving (23) explicitly without the need for 
any intermediate factorization. 

For computational efficiency, PJ 1 is implemented as: 

P7 1 = K{ + K i (24) 

where K[ and K 2 are the traces of Ki and K 2 on Tj. Clearly, with this choice 
for the preconditioner, the auxiliary system (23) is “cheap”, easy to solve and 
perfectly parallelizable on both local and shared memory parallel architectures. 

Since we do not have a strong mathematical justification for this choice of 
the preconditioner, we have conducted a set of numerical experiments to assess 
a priori its performance. A fixed-fixed cylindrical panel was discretized with an 
N by M regular mesh and wets modeled with 4 node shell elements (fig. 4). All 
test cases used N a = 2 and a vertical slicing. 



FIG. 4 Cylindrical panel - N s — 2 
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Table 1 shown below reports the condition numbers of the global stiffness matrix 
K, the subdomain stiffness matrices Kj and K 2 , and the original and precondi- 
tioned interface flexibility matrices F / and PJ^Fj, for various values of N. 


TABLE 1 


Condition numbers 


Cylindrical panel - N by M mesh - shell elements - 2 subdomains 


N 

M 

«(K) 

k(Ki) 

k(K 2 ) 

"(F,) 

«(P7‘F/) 

10 

5 

2.5 10 4 

5.6 10 3 

5.6 10 3 

1.4 10 4 

4.9 10 2 

20 

10 

3.4 10 5 

2.1 10 4 

2.1 10 4 

2.8 10 4 

3.8 10 3 

40 

20 

5.4 10 6 

9.1 10 4 

9.1 10 4 

1.2 10 5 

3.1 10 4 


For this test problem, the condition number of the preconditioned interface 
is two order of magnitude lower than that of the global problem. 

The extrapolation of (22) and (24) to N s > 2 is straightforward. In order to 
reduce furthermore the number of preconditioned conjugate projected gradient 
iterations, the selective reorthogonalization procedure developed by Roux and 
reported in [13] is also utilized. 


6. Parallel characteristics of the proposed method 

Like most domain decomposition based algorithms, the proposed method of finite 
element tearing and interconnecting is perfectly suitable for parallel processing. 
If every subdomain fij is assigned to an individual processor pj , all local finite 
element computations can be performed in parallel. These include forming and 
assembling the stiffness matrix K ; and the forcing vector fj, factoring K j and 
eventually computing the rigid modes as well as backsolving for u ; after A and 
a have been determined. The conjugate projected gradient algorithm described 
in Section 4 is also amenable to parallel processing. For example, the matrix- 
vector product F/s( fc ) can be computed in parallel by assigning to each processor 

Pj the task of evaluating y^ = B*K“ 1 B* T s ( j k \ and exchanging y^ with the 
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processors assigned to neighboring subdomains in order to assemble the global 
result. Interprocessor communication is required only during the solution of the 
interface problem (14) and takes place exclusively among neighboring processors 
during the assembly of the subdomain results. 


At this point, we stress that the parallel solution method developed herein 
requires inherently less interprocessor communication than other domain decom- 
position based parallel algorithms. As mentioned earlier, interprocessor com- 
munication within the proposed method occurs only during the solution of the 
interface problem (14). The reader should trace back this problem as well as the 
presence of the Lagrange multipliers to the integral quantity: 



(25) 


where T / tj is the interface between subdomains fl,- and Slj. If T j (J has a zero 
measure, then (u, — Vj,\)r r . = 0 and no exchange of information is needed be- 
tween f li and f 1j. Therefore the subdomains which interconnect along one edge 
in three-dimensional problems and those which interconnect along one vertex in 
both two and three-dimensional problems do not require any interprocessor com- 
munication. This is unlike the parallel method of substructures, whether the 
interface problem is solved with a direct scheme [1] or with an iterative one [2]. 
For a three-dimensional regular mesh that is partitioned into subcubes, the pro- 
posed method of finite element tearing and interconnecting requires that each 
subdomain communicate with at most six neighboring subdomains (since a cube 
has only six faces), while the parallel method of substructures necessitates that 
each subdomain communicate with up to 26 neighbors (fig. 5). This communi- 
cation characteristic makes the proposed parallel solution method very attractive 
for a multiprocessor with a distributed memory such as a hypercube. Indeed, 
the advantages of the method for this family of parallel processors are two folds: 
(a) the number of message-passing is dramatically reduced, which reduces the 
overhead due to communication start-up, and (b) the complexity of the commu- 
nication requirements is improved so that an optimal mapping of the processors 
onto the subdomains can be reached (Bokhari [14], Farhat [15]); therefore the 
elapsed time for a given message is improved. Both enhancements (a) and (b) 
reduce the communication overhead of the parallel solution algorithm in a syner- 
gistic manner. This algorithmic feature of the proposed method is still desirable 
for shared memory multiprocessors because it eases the assembly process during 


18 


the interface solution and makes the latter more manageable. It is not however 
as critical for the performance as it is for local memory multiprocessors. 




FIG. 5 Reduced interprocessor communication patterns 
for two and three-dimensional regular mesh partitions 


7. Tearing vs. substructuring 

Another difference between the subdomain based parallel solution method devel- 
oped in this paper and the parallel method of substructures lies in the formulation 
of the interface problem. For the method of substructures, the interface problem 
corresponds to a stiffness formulation. For the two-subdomain decomposition it 
can be written as: 


(Kn- K&K^K,/ — KjiKjjKa/Ju/ = f/z-K^K^fn - K^K^i 22 ( 26) 

where K //, Kn and K22 are the stiffness matrices associated respectively with 
the interface nodes and the interior nodes of subdomains and Q2> and Ki / and 
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K 2 / axe the coupling stiffnesses between respectively and T/ and fl 2 and T/ 
(see, for example [1] for further details). A standard conjugate gradient algorit hm 
may be used for solving (26). On the other hand, the resulting interface problem 
for the method of finite element tearing and interconnecting corresponds to a 
flexibility formulation. For the two-subdomain decomposition, it can be written 
as in (14) and necessitates the use of a conjugate projected gradient algorithm 
for finding the solution A. 

If Ki and K 2 axe partitioned into internal and boundary (interface) compo- 
nents and then axe injected into the first of equations (7), it can be easily shown 
that: 


BaK^Bf = (K<V - K^K^K,,)- 1 
B 2 K 2 - 1 B^’ = (Kff-K&K^Kw)- 1 


where and denote respectively the contributions of the first and second 
subdomains to K 77. Equations (27) above establish the relationship between 
both approaches to domain decomposition. 

The computational implications of the differences between the two solution 
methods are as follows: 

• within each iteration, the solution process of problem (14) requires an 
additional computational step which corresponds to the projection of the 
search direction onto the null space of R 2 . 

• within each iteration, the solution process of problem (14) requires the 
evaluation of the matrix-vector product BjK^BjV*), while the solution 
process of problem (26) requires the evaluation of the matrix- vector product 
K jl K 7j KjisW. Given that B j is a boolean matrix and that its application 
to a matrix or a vector defines a floating-point-free extraction process, each 
conjugate gradient iteration applied to (14) is less computationally intensive 
than its counterpart that is applied to (26). 

• since a conjugate gradient algorithm captures initially the high frequency 
mesh mode of a problem, it can be expected to perform better on a flexibility 
matrix than on a stiffness matrix because the high frequencies of the former 
are indeed the low frequencies of the stiffness matrix which are closer to the 
solution of the static problem. 
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In the light of the above remarks, it is reasonable to expect that for a given mesh 
partition: 

• each conjugate projected gradient iteration that is applied to the solution 
of the interface problem (14) which results from the method of finite element 
tearing and interconnecting will not be slower — and may be even faster 
for large-scale problems and a small number of interface nodes, than each 
conjugate gradient iteration applied to the solution of the interface problem 
(26) which results from the method of substructures. 

• the iterative solution of the interface problem associated with the tearing 
method will exhibit a faster rate of convergence them the iterative solution 
of the interface problem resulting from the conventional method of substruc- 
tures. 

Finally, it should be noted that domain decomposition methods in general 
exhibit a larger degree of parallelism than parallel direct solvers. The efficiency of 
the latter is governed by the bandwidth of the given finite element system of equa- 
tions. If the bandwidth is not large enough, interprocessor communication and/or 
process synchronization can dominate the work done in parallel by each proces- 
sor. This is true not only for multiprocessors with a message-passing system, but 
also for super-vector-multiprocessors with a shared memory such as the CRAY 
systems, where synchronization primitives are rather expensive. Therefore, the 
computational method described in this paper should be seriously considered 
for large-scale problems with a relatively small or medium bandwidth. These 
problems are typically encountered in the finite element analysis of large space 
structures which are often elongated and include only a few elements along one 
or two directions (Farhat [16]). The method is also recommended for problems 
where the storage requirements of direct solvers cannot be met. 


8. Optimal mesh decomposition 

The computational method described in this paper requires that the given finite 
element mesh be partitioned into as many submeshes as there are available pro- 
cessors. In this section, we establish some guidelines for the design of an optimal 
mesh partition by analyzing the effect of its structure on the performance of the 
global solution algorithm. 

From the numerical point of view, the proposed solution method is hybrid in 
the sense it combines a direct and an iterative schemes. The direct solver is ap- 
plied to each subdomain problem, the iterative one to the interface between these 
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subdomains. If the mesh partition is such that the bandwidth of each subdomain 
problem is of the same order as the bandwidth of the global unpartitioned sys- 
tem of equations, the overall algorithm performs more operations than a direct 
method applied to the global unpartitioned system, independently of how fast 
the interface problem converges. The slicing of a parallelepiped along its largest 
dimension yields such a partition (fig. 6). If on the other hand the same paral- 
lelepiped is partitioned such that the bandwidth of each subdomain problem is 
much smaller than the bandwidth of the original finite element system (fig. 7), 
and if the convergence of the interface problem is fast enough, the method of 
finite element tearing and interconnecting may produce the solution with fewer 
computations than a global direct solver. 



(a) the number of interface nodes, and (b) the interconnectivity of the subdomains 
along their interface. It can be easily checked that within one iteration of the 
conjugate projected gradient algorithm, a new information that is issued from a 
subdomain Qj reaches only those subdomains that interconnect with ■ along 
an edge or a plane. Therefore, the interface problem converges faster for . mesh 
partition that is characterized by a larger effective interconnectivity ban dth. 

The above observations suggest that an automatic finite element mesh de- 
composer that is suitable for the computational method described herein should 
meet or strike a balanced compromise be: men the seven following requirements: 

1. it should yield a set of subdomains where the bandwidth of each local problem 
is only a fraction of the bandwidth of the global system of equations; 

2. it should keep the amount o: nterface nodes as small as possible in order to 
reduce the size of the interfac problem; 

3. it should yield a set of subdomains with a relatively high interconnectivity 
bandwidth so that within each iteration a new correction reaches as many 
subdomains as possible; 

4. it should avoid producing subdomains with “bad” aspect ratio (for example, 
elongated and flat subdomains) in order to keep the local problems as well- 
conditioned as possible; 

5. it should deliver as few as possible floating subdomains in order to keep the 
cost associated with the projected gradients as low as possible; 

6. it should yield a set of balanced subdomains in order to ensure that the 
overall computational load will be as evenly distributed as possible among 
the processors; 

7. it should be able to handle irregular geometry and arbitrary discretization 
in order to be general purpose. 

For some mesh topologies, it becomes very difficult to meet simultaneously 
requirements (1), (2) and (4). In that case, priority should be given to the 
first two requirements. However, we have found that for many problems, the 
above requirements can be met, using for example a slightly modified version 
of the general purpose finite element decomposer presented by Farhat in [17]. 
Several decomposition examples are described in Section 9. The most challenging 
problem that is yet to be resolved is the rational relationship between the mesh 
decomposition and the interface conditioning. 
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9. Applications and performance assessment 


We first illustrate the proposed parallel computational method with the static 
analysis on a 32 processor iPSC/2 hypercube of a three-dimensional mechanical 
joint subjected to internal pressure loading. We report performance results which 
show that the parallel method of tearing exhibits a better speed-up than the 
parallel method of conventional substructuring because it consumes three times 
less interprocessor communication. Next, we apply our algorithm to the large- 
scale finite element analysis on a 4 processor CRAY-2 of a three-dimensional 
cantilever composite beam made of more than one hundred stiff carbon fibers 
bound by a nearly incompressible elastomer matrix. We report and discuss in 
details the measured performance results for various mesh partitioning strategies. 
For that problem, the proposed solution method outperforms the direct Choleski 
factorization by a factor greater than three, even for configurations that yield 
very ill-conditioned systems. In the following, NP, NE, NDF , T msg , T p and 
SP denote respectively the number of processors, the number of elements, the 
number of degrees of freedom, the time elapsed in message-passing, the total 
parallel time and the overall parallel speed-up. 


The finite element discretization of the mechanical joint using 8 node brick 
elements is shown in figure 8. Two meshes are considered. The first one con- 
tains 5002 elements, 14932 degrees of freedom and is intended for a 16 processor 
cluster of the iPSC/2. The second mesh has 9912 elements, 29654 degrees of free- 
dom and is constructed for a 32 processor configuration of the same hypercube. 
The mesh decompositions into 16 and 32 subdomains are carefully designed to 
be topologically equivalent as much as possible to a checkerboard partitioning. 
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Consequently, many of the resulting subdomains are floating. 



FIG. 8 Finite element discretization of a mechanical joint 

The interprocessor communication time per iteration, the total parallel exe- 
cution time, and the overall parallel speed-up associated with the parallel method 
of tearing and the parallel method of substructures are reported in table 2 for 
both meshes. For all cases, a tolerance of 10 — 3 on the global relative residuals is 
selected as a convergence criterion. 


TABLE 2 

Performance results on iPSC/2 


Mechanical joint - brick elements - 16 and 32 subdomains 


NP 

NE 

NDF 

Tmsg/itr. 

subs. 

Tmsgf Hr • 
tearing 

T P 

subs. 

t p 

tearing 

SP 

subs. 

SP 

tearing 

16 

5002 

14932 

16.3 m.s. 

5.2 m.s. 

602 s. 

546 s. 

14.4 

15.4 

32 

9912 

29654 

17.9 m.s. 

5.4 m.s. 

1103 s. 

917 s. 

24.0 

28.8 
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For both cases, the parallel tearing and parallel substructuring algorithms achieve 
excellent speed-up. This is generally true for sill balanced algorithms that require 
message-passing only between neighboring processors. However, for this problem, 
the tearing algorithm is faster and exhibits a 20 % higher speed-up than the con- 
ventional substructuring algorithm for which the time elapsed in interprocessor 
communication is 3.31 times higher. Again, because it avoids interprocessor com- 
munication along the edges and comers of the subdomains, the tearing algorit hm 
requires fewer message-passing startups which, in the case of short messages, are 
known to account for the largest portion of the time elapsed in interprocessor 
communication on the iPSC/2 (see, for example, the benchmarks of Boman and 
Rose [18]). A performance comparison with a parallel direct solver is not provided 
because of the lack of memory space to store in-core the triangular factors of K. 


Now that the parallel properties of the presented algorithm have been illus- 
trated, we focus next on example problems that illustrate its intrinsic properties 
and performance. We consider the large-scale finite element static analysis of the 
pure bending of a set of beams made of similar jointed composite “pencils” (fig. 
9). Each composite pencil contains one carbon fiber with its elastomer matrix 
and is discretized in 51 vertical layers containing each 25 mesh points. The cross 
section of the finite element mesh corresponding to a 16 pencil beam is shown in 
figure 10. 




FIG. 9 A composite beam and a composite pencil 


26 


FIG. 10 Cross section of the finite element mesh 
for a 16 pencil composite beam 


The numerical results obtained on a 4 processor CRAY-2 for a 16 pencil beam 
with 48000 degrees of freedom are reported in figures (11-12). These correspond 
to two extreme mesh decompostions, namely: a horizontal cross-slicing into 4 
subdomains each containing 4 cantilever parallel pencils (HI), and (b) a vertical 
slicing into 4 subdomains of which three are floating (D2). Poisson’s ratio for the 
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elastomer is 0.49. 


PRECONDITIONED TEARING METHOD 



FIG. 11 Numerical results for decomposition D 1 


PRECONDITIONED TEARING METHOO 



N*. of lt*r4tlo« 


FIG. 12 Numerical results for decomposition D 2 
For each decomposition case, three curves are reported which correspond 
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to monitoring convergence with three different measures: (A) the global force 
relative residual, (B) the displacement relative variation, and (C) the interface 
relative residual. Clearly, decomposition Dl induces a faster convergence rate 
than decomposition D2. We have predicted this result since within each itera- 
tion of the iterative solution of the interface problem, information reaches all of 
the subdomains in decomposition D 1, while it reaches only half of these in de- 
composition D2. Another important result relates to the relative positioning of 
the three curves, independently from the decomposition pattern. Note first that 
convergence with the global force relative residual is harder to achieve than con- 
vergence with the displacement relative variation. This is because the problem 
suffers from a severe ill-conditioning due to the incompressibility of the elastomer 
(Poisson ratio = 0.49) and the elongated shape of the cantilever composite beam. 
Note also that convergence with the interface relative residual is closer to con- 
vergence with the displacement relative variation than it is to convergence with 
the global force relative residual. This is because the interface problem is formu- 
lated in the functional space of the stresses, so that its residuals correspond to a 
displacement increment. 

Finally, the tearing method is compared for performance with a direct 
Cholesky factorization. The same bending problem is selected for that purpose. 
Several different mesh configurations which correspond to different numbers of 
pencils are considered. Performance results on a CRAY-2 single processor are re- 
ported in Table 3. A tolerance of 10 -6 on the global relative residuals is selected 
as a convergence criterion. 
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TABLE 3 

Performance results on CRAY- 2 

Composite beam - brick elements - direct vs. 4-subdomain tearing 


Number of pencils 
NDF 

4 

13000 

9 

28000 

16 

48000 

4-subdomain 

tearing 

method 


NDF interface 
# iterations 
CPU time 
Memory size 

2450 
130 
20 s. 
1.6 m.w. 

7350 
210 
73 s. 
4.5 m.w. 

14700 
300 
193 s. 
9.5. m.w. 

Global 

Cholesky 

factorization 


CPU time 
Memory size 

15 s. 
3.6 m.w. 

130 s. 
16 m.w. 

650s. 
50 m.w. 


The above results demonstrate that for sufficiently large problems, the tear- 
ing method can outperform direct solvers. For the particular problem above it 
runs up to 3.3 times faster than Cholesky factorization and requires 5.2 times less 

memory space. 


10. Closure and overview of subsequent research 

A novel domain decomposition approach for the parallel finite element solution 
of equilibrium equations is presented. The spatial domain is partitioned into a 
set of totally disconnected subdomains, each assigned to an individual proces- 
sor. Lagrange multipliers are introduced to enforce compatibility at the interface 
nodes. In the static case, each floating subdomain induces a local singularity that 
is resolved in two phases. First, the rigid body modes are eliminated in parallel 
from each local problem and a direct scheme is applied concurrently to all sub- 
domains in order to recover each partial local solution. Next, the contributions 
of these modes are related to the Lagrange multipliers through an orthogonali y 
condition. A parallel conjugate projected gradient algorithm is developed for the 
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solution of the coupled system of local rigid modes components and Lagrange 
multipliers, which completes the solution of the problem. When implemented 
on local memory multiprocessors, this proposed method of tearing and intercon- 
necting requires less interprocessor communications than the classical method of 
substructuring. It is also suitable for parallel/vector computers with shared mem- 
ory. Large-scale example applications are reported on the iPSC/1 and CRAY-2. 
Measured performance results illustrate the advantages of the proposed method 
and demonstrate its potential to outperform the classical method of substructures 
and parallel direct solvers. 

It is our experience that domain decomposition methods are very sensitive 
to the mesh partition. In this paper, we have outlined some guidelines for the 
practical decomposition of a given finite element mesh. Subsequent research will 
focus on determining the relationship between a pattern of decomposition and the 
resulting conditioning of each of the local problems and the interface one. While 
several preconditioners for conventional domain decomposition methods (Schur 
methods) are available in the litterature, further research is needed to develop a 
preconditioner for hybrid domain decomposition algorithms such as the tearing 
method developped herein. 


Appendix A. Solving a consistent singular system KjUj = f j 

For completeness, we include in this appendix a derivation of the solution of a 
consistent singular system of equations. In this work, such a system arises in 
every floating subdomain f lj and takes the form: 

K>Uj = f, ( 28 ) 

where K j is the (nj -f- n j) x (n* + rij ) stiffness matrix associated with f lj, and Uj 
and f j are the corresponding displacement and forcing vectors. If f lj has n’j rigid 
body modes, Kj is rank n T j deficient. Provided that fj is orthogonal to the null 
space of Kj, the singular system (28) is consistent and admits a general solution 
of the form: 


xij = Kpj + Rj* (29) 

where K+ is a pseudo-inverse of Kj — that is, K* verifies KjKj"Kj = K j, R_, 
is a basis of the null space of Kj — that is, Rj stores the n r j rigid body modes 
of Qj, and a is a vector of length nj containing arbitrary real coefficients. 
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A. 1 Computing the rigid body modes 


Let the superscripts p and r denote respectively a principal and a redundant 
quantity. The singular stiffness matrix K j is partitioned as: 




■ Kf k y 
k p / t k? . 


(30) 


where K. pp has full rank equal to rij + nj — rij. If Rj is defined as: 

Rj = 


-Kf 


-l 


Inr 


K pr 


(31) 


where I n j is the n r - x n r - identity matrix, then Rj satisfies: 

KjRj = 0 


Moreover, I n r has full column rank and so does Rj. Therefore, the n r - columns 
of Rj as defined in (29) form a basis of the null space of Kj. 

A. 2 Computing Kj"fj 

The partitioning of the singular matrix Kj defined in (30) implies that: 

Kj r = K^ rT K^“ J Kj r (32) 


Using the above identity, it can be easily checked that the matrix Kj* defined as: 



Kf 1 1 

O 


O 

o 


is a pseudo- inverse of Kj. Therefore, a solution of the form Kj"fj can be also 
written as: 


U, - Kftj 


o 


In practice, Kj cannot be explicitly re-arranged as in (30). Rather, the 
following should be implemented when Kj is stored in skyline form. A zero 
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pivot that is encountered during the factorization process of K ; corresponds to a 
redundant equation which needs to be labeled and removed from the system. The 
zero pivot is set to one, the reduced column above it is copied into an extra right 
hand side — this corresponds to a forward reduction with K pr uJ as right hand 
side, and the coefficients in the skyline corresponding to that pivotal equation 
are set to zero. At the end of the factorization process, the non-labeled equations 
define the full rank matrix K P j P . The backward substitution is modified to operate 

also on the nj extra right hand sides in order to recover u? = -K? p-1 K? r u^. 

The above procedure for solving a consistent singular system of equations 
has almost the same computational complexity as the solution of a non-singular 
one. 


Appendix B. Starting Lagrange multiplier vector 

In this appendix we present a fast scheme for generating a starting vector A (0) 
for the conjugate projected gradient algorithm (19-20). We consider the general 
case of an arbitrary mesh partition. 


For each floating subdomain fij, the corresponding component of the starting 
vector has to satisfy the equality constraint: 



A (°) 


= Rfr 


(33) 


where Ry is an ( n \ * + Uy) x n \ • full column rank matrix which stores the rigid 
body modes of the floating subdomain fly, Ry is the restriction of Ry to the 
intersection of f 1j and the interface T/, and f y is the vector of prescribed forces 
in f lj. If A is written as: 


A<°> = 




(o) 


then (33) becomes: 


(R' T R>f = Rjt, 


( 0 ) 


(34) 


(35) 


which admits as solution: 
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(36) 


»?> = (RfRjj-'Rjfi 

Therefore, a starting vector A^ which satisfies the constraint equation (33) is 
given by: 

A‘°> = R J , (R' 7 R'r , Rj'f, (37) 

rp 

The matrix product (Rj Rj) is only n r - x rij, where n” is the number of rigid 

body modes of the floating subdomain Q, j. Therefore, (R^ R^) is at most 3x3 
in two-dimensional problems and at most 6 x 6 in three-dimensional problems, 
and the evaluation of A^ according to (37) requires little computational effort. 
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Abstract. A domain decomposition algorithm based on a hybrid variational 
principle was proposed in reference [1] for the parallel finite element solution of 
self-adjoint elliptic partial differential equations. First, the spatial domain was 
partitioned into a set of totally disconnected subdomains and an incomplete finite 
element solution was computed in each of these subdomains. Next, a number of 
Lagrange multipliers equal to the number of degrees of freedom located at the 
binding interface were introduced to enforce compatibility constraints between 
the independent local finite element approximations. For structural and mechan- 
ical problems, the resulting algorithm was shown to outperform the conventional 
method of substructures, especially on parallel processors. Here, the use of a much 
lower number of Lagrange multipliers for interconnecting the incomplete field fi- 
nite element solutions is investigated. When accuracy is preserved, this approach 
reduces drastically the computational complexity of the Schur-complement-like 
coupling system that is associated with the interface region and enhances signifi- 
cantly the overall performance of the methodology. Finite element procedures for 
both global and piecewise polynomial approximations of the Lagrange multipli- 
ers are derived. Finally, some numerical results obtained for structural example 
problems that validate the main idea and highlight its advantages are presented. 
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1. Introduction. 


Recently, Faxhat and Roux [1] have presented a parallel finite element computa- 
tional method for the solution of static equilibrium problems that is a departure 
from the parallel method of substructures (see, for example, Nour-Omid, A. Raef- 
sky and G. Lyzenga [2], Farhat and Wilson [3]). The unique feature about the 
proposed procedure is that it requires fewer interprocessor communication than 
traditional domain decomposition algorithms, while it still offers the same amount 
of parallelism. The computational strategy was denoted by “finite element tearing 
and interconnecting” because of its resemblance with the very early work of Kron 
[4] on tearing methods for electric circuit models. Basically, the finite element 
mesh is “tom” into a set of totally disconnected submeshes and a computational 
strategy is derived from a hybrid variational principle where the inter-subdomain 
continuity constraint is removed via the introduction of a Lagrange multiplier 
function. 

In reference [1], the authors have interconnected the subdomain incomplete 
finite element solutions with a number of discrete Lagrange multipliers that is 
equal to the number of degrees of freedom that are lying on the binding interface. 
That allowed them to recover exactly the same finite element solution as with 
non-hybrid variational principles. Here, we consider the use of a substantially 
lower number of discrete Lagrange multipliers, which would further enhance the 
serial and parallel performance of the proposed computational algorithm when an 
adequate accuracy is preserved. The fundamental idea is not essentially different 
from the one presented in the mathematical work of Dorr [5]. In order to motivate 
this approach, we first re-derive in Section 2 the basic method of tearing and 
interconnecting and summarize its major computational advantages. In Sections 
3 and 4, we develop polynomial and piecewise low order polynomial expressions for 
the finite element discretization of the interface Lagrange multiplier function and 
describe their computer implementation. We consider both cases of continuum 
and lattice structures. In Section 5, we present an iterative refinement procedure 
for improving the accuracy of the resulting algorithm and in Section 6 we report on 
some numerical results obtained for two-subdomain problems and problems where 
the meshes are decomposed with one-way separators only. These preliminary 
results indicate that a very high accuracy is achieved with a very low number of 
discrete Lagrange multipliers. We also highlight the computational advantages 
of the proposed parallel algorithm with the large-scale static analysis of the Solid 
Rocket Booster (SRB) on the CRAY Y-MP; for that problem, the parallel skyline 
and banded solvers are outperformed. 
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2. A method of finite element tearing and interconnecting 


Here we summarize a domain decomposition based algorithm associated with a 
hybrid formulation for the parallel finite element solution of the linear elastostatic 
problem (Farhat and Roux [1]). The method is equally applicable to the finite 
element solution of any self-adjoint elliptic partial differential equation. For the 
sake of clarity, we consider only the case of two subdomains. The generalization 
for an arbitrary number of subdomains is fully developed in [1]. 


The variational form of the three-dimensional elastostatic boundary-value 
problem goes as follows. Given g and h, find the displacement function u which 
is a stationary point of the energy functional: 


J(v) 


-a(v,v) - (v,g) - (v,h) r 


where 


a(t;, to) 


L 


v (i,j) c ijkl w (k,l) 


8Q 


( v ,9) 

(v, h) r 


Vigi SQ 


IQ 



Vih{ <£T 


( 1 ) 


In the above, the indices i,j, tc take the value 1 to 3, = (t/, j -f Vj tl )/2 and 

denotes the partial derivative of the i — th component of v with respect to 
the j — th spatial variable, Cijki axe the elastic coefficients, £) denotes the volume 
of the elastostatic body, T its piecewise smooth boundary, and T h the piece of V 
where the tractions h x are prescribed. 


If is torn into two regions Qi and (Fig. 1), solving the above elastostatic 
problem is equivalent to finding the two displacements functions u\ and u<i which 
are stationary points of the energy functionals: 
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JiM = 

^a(ui,ui)nt -(vi,fl')n 1 -(vi,h) ri 



J 2 {y 2 ) = 

^a(u 2 , u 2 )n 2 - (v 2 , g)a 2 - (v 2 , h)r 2 



where 



— 

a(v u wi)a l = 

1 Vl(i,j) c ijkl w l(k,l) 
JU ! 


— 

a(v 2 ,w 2 )n 2 = 

/ v 2(i,j) c ijkl w 2(k,l) 
J u 2 

(2) 


(vi,g)o l = 

f v u fi 60 
J n t 



( v 2 ,g)n t = 

/ v 2i fi SO 
J n 2 



(vi,h)r x = 

[ v u hi sr 

J r ht 



(u 2 ,^)r 2 = 

f v 2i hi ST 
Jr hi 


— 


and which satisfy on the interface boundary T / the continuity constraint: 


ui = u 2 on f/ (3) 
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FIG. 1 Tearing in two subdomains 

The two above variational problems (2) with the subsidiary continuity condi- 
tion (3) can be casted into a single hybrid variational principle (see, for example, 
Pian [6], Zienkiewicz and Taylor [7] and references cited therein) which corre- 
sponds to finding the saddle point of the total potential energy: 

J*(vi,v 2 ,n) - Ji(vi) + J 2 (v 2 ) - f A(»i - v 2 ) ST (4) 

j F/ 

If now the displacement fields u\ and U 2 are expressed by suitable shape 
functions as: 


u\ — Nui and u 2 — NU 2 (5) 

and the continuity equation is enforced for the discrete problem — that is, if a 
discrete Lagrange multiplier is introduced at each i — th degree of freedom of 
the discrete interface boundary T/, a standard Galerkin procedure transforms the 
hybrid variational principle (4) in the following algebraic system: 
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K lUl = fi + Bf A 
K2U2 = (2 — B^A 
B1U1 = B2U2 


( 6 ) 


where Kj, Uj, and fj, j = 1, 2, are respectively the stiffness matrix, the displace- 
ment vector, and the prescribed force vector associated with the finite element 
discretization of fij. The vector of Lagrange multipliers A represents the interac- 
tion forces between the two subdomains fix and along their common boundary 
T/. It introduces in the above system of equations the quantities Kf 1 B^A and 
K^B^A which implicitly correct the incomplete finite element solutions K^fx 
and K^ a f 2 . 

Within each subdomain flj, we denote the number of interior nodal unknowns 
by rij and the number of interface nodal unknowns by n 1 -. The total number 
of interface nodal unknowns is denoted by n/. Note that n/ = n[ = in 
the particular case of two subdomains. If the interior degrees of freedom are 
numbered first and the interface ones are numbered last, each of the two boolean 
connectivity matrices Bi and B 2 takes the form: 


B; = [Oj Ij] j = 1,2 


(7) 


where Oj is an nj x n* null matrix and Ij is the nj x n 1 - identity matrix. The 
vector of Lagrange multipliers A is n/ long. 


The stiffness matrices Kj and K 2 are non singular if and only if each of 
the defined subdomains has enough prescribed boundary conditions to eliminate 
its rigid body modes. However a typical mesh decomposition often produces a 
certain number of floating subdomains. If in the above example fl 2 is a floating 
subdomain, equations (6) can be re-arranged after some algebraic manipulations 
(see [1]) as: 




A 


'B 2 K+f 2 -B 1 K 1 - 1 fr 


ct 


-Rjf 2 


Ul = K^Cfx + BfA) 
u 2 = K+(f 2 -B^A) + R 2 ct 


(8) 


where F/ = BjKj 1 B^+B 2 K^Bj\ is a pseudo-inverse of K 2 , R -2 is an (n| + 
n 2 ) x n 2 rectangular matrix whose columns represent the n r rigid body modes of 
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0,2 and a specifies a linear combination of these. For three-dimensioneil problems 
< 6, and for two-dimensional problems n £ < 3. Clearly, the Lagrangian matrix 
is indefinite. However, F/ is symmetric positive definite and has full column 
rank. Therefore, the system of equations in (A, a) is symmetric and non-singular. 
It admits a unique solution (A ,a) which uniquely determines Ui and U 2 - 

It is important to note that since < 6, the Lagrangian system (8) and F / 
have almost the same size. For an arbitrary number of subdomains N, of which 
N / are floating, the additional number of equations introduced by the handling 
of local singularities is bounded by 6Nf. For large-scale problems and relatively 
coarse mesh partitions, this number — which determines the size of ex, is a very 
small fraction of the size of the global system. On the other hand, if a given 
tearing process does not result in any floating subdomain, a vanishes and the 
corresponding Lagrangian and F/ systems become identical. 


In reference [l], a set of guidelines for carrying out the practical decom- 
position of an arbitrary mesh, as well as a parallel computational scheme for 
solving equations (8) in the presence of an arbitrary number of subdomains were 
presented. The proposed computational scheme featured a parallel precondi- 
tioned conjugate ■projected gradient algorithm for the solution of the indefinite 
Lagrangian system. It was also shown that the proposed method of finite element 
tearing and interconnecting compares favorably with the conventional method of 
substructures and with direct solvers on both serial and parallel computers. It 
is particularly attractive for local memory multiprocessors such as hypercubes 
because it intrinsically requires much less interprocessor communication than the 
parallel method of substructures [2]. This is because the need for interprocessor 
communication in this formulation is exclusively induced by the weak form of the 
continuity constraint: 



and because if T i {j has a zero measure, then (v* — vj, A)r /;i = 0 and no exchange of 
information is needed between subdomains ft, and Oj. Therefore the subdomains 
which interconnect along one edge in three-dimensional problems and those which 
interconnect along one vertex in both two and three-dimensional problems do not 
require any interprocessor communication. This is unlike the parallel method of 
substructures and other conventional domain decomposition algorithms. 
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The efficiency of the tearing method outlined above depends on how fast the 
Schur complement or interface system represented here by F/ can be solved. This 
is often the case for many of the subdomain based implicit/explicit parallel solu- 
tion algorithms. The nature of Fj = B x Kf *Bf +B 2 K^B^ makes the solution of 
the interface system inadequate by any technique which requires this submatrix 
explicitly. This implies that a direct method or an iterative method of the SOR 
type cannot be used. The only efficient method for solving this system is that of 
conjugate gradients, because once Ki and K 2 have been factorized, matrix-vector 
products of the form F/v can be performed very efficiently using only forward 
and backward substitutions. Therefore convergence rate becomes the key factor 
for enhancing the overall efficiency of the procedure. In reference [1], the authors 
have considered careful mesh partitioning schemes and a suitable preconditioner 
for improving this convergence rate. Here we investigate an approach for speeding 
up the solution of the interface system which consists of reducing drastically its 
size. When this can be achieved (without hurting accuracy) to an extent where 
F / can be explicitly formed, assembled and stored, a direct solution method is 
applied to the Schur complement equations so that the convergence rate is not 
any longer an issue. Otherwise, the same semi-iterative algorithm as presented 
in [1] is used for solving the new interface system that is characterized by a much 
smaller size than in our previous work. 

In this paper, we concentrate on the two-subdomain problem which high- 
lights the main idea and does not require a substantial amount of coding. The 
obtained results are so encouraging (Section 6) that we have started developing 
the necessary software for handling arbitrary mesh decompositions with multiple 
subdomains. This effort will be reported in a forthcoming paper. 

Next, we discretize the Lagrange multiplier function that binds the subdo- 
main incomplete solutions using a polynomial approximation and derive the finite 
element representation of the new interface system. 


3. Approximating the Lagrange multipliers with polynomials 

The weak form of the equations of static equilibrium associated with the hybrid 
variational principle formulated in equation (4) is obtained using the standard 
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virtual work principle. It is expressed as: 


f SuiVBLmS^- I 
at J. 

r t’ 

SuJXST - 
r / 

/ <$Ui T g6Q- / tfu/lnSr 
Jn i Jri 

[ <5u 2 T L r DLu2<5fl2 + 1 

f <5u 2 r A<5r — 

[ 8u2 T g6Sl- j 8u 2 T hST 

n 7 J 

[ '/ 

Jn 7 J r f 



f 8X(uJ - u^)(5r 


rj 


( 10 ) 


where the vectors g and h have been defined in equations (2), the vectors Ui and 
U 2 in equations (5), and D and L are the matrix representations of, respectively, 
a constitutive equation and a spatial derivative operator. 


If the Lagrange multiplier function A is degree-of-freedom collocated along 
the interface — that is, a discrete Lagrange multiplier scalar A; is attached at 
each degree of freedom lying on the interface boundary T/, the above equations 
are transformed into the algebraic equations (6), where the vector of Lagrange 
multipliers A is nj long. As a result, the interface system of equations (8) is nj x nj 
large. In order to reduce the size of this system, we consider first a polynomial 
approximation for A. For this purpose, we assume that the finite element problem 
of interest has d degrees of freedom per node and that the interface T j between 
and 12 2 is parametrized by a curvilinear abscissa s (Fig. 2). 



FIG. 2 P arametrization of a two-subdomain interface 
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We define d polynomials of degree p as: 


A 1 ^) = 

A 2 (s) = 




k = 0 




fc =0 


( 11 ) 


A*) = £V 

Jfc=0 


where p is much smaller than n/ and {A[., A|, Xf, A: = 0, 1, ...,p} are (p+l)d un- 
known discrete Lagrange multipliers. Physically, these still represent the interface 
tractions that are necessary to maintain equilibrium between the two subdomains 
fii and fi 2 . The superscript j, j = 1, 2 — , d denotes the directional freedom (x, y, 
or z displacement/rotation) of the corresponding traction component. However, 
unlike in our previous work, these multipliers are not specified at any location 
of the discrete interface T /. In particular, they are not necessarily attached to 
any particular node. Substituting (11) into (10) after re-arranging the third of 
equations (10) results in the algebraic system: 

K lUl = f 1 +B pT X p 

K 2 u 2 =f 2 -BfA p (12) 

BP Ul = B*u 2 

where \ p is now the ( p + 1 )d long vector: 

■Ap = [Aq AS . . . A$ . . . Ap Ap . . . Ap ] T (13) 
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and Bj and B£ are now non-boolean finite element matrices of sizes (p + l )d x 
(nf + nj ) and ( p + l)d x (n| + n/) that axe assembled from their element level 
correspondents Bj (e) and B ^ in the usual m ann er 

b p _ £ B f> j = 1,2 ( 14 ) 


where e spans only the set of elements that are connected to the interface bound- 
ary T/. For a finite element e with q nodes lying on T/, the qdx(p + 1 )d element 
level matrices B^ , j = 1,2 are given by: 



~ 1 p>p( e ) " 

j 

2 B 



( 15 ) 


where / = 1, 2 , q is a d x (p+ l)d matrix associated with the l — th node 

of element e and has the following form: 


'B? (e) = ['SJ (<) [B p(e) l 2 B p(e) 


i B p{t) ] ( 16 ) 


and l k B p ^ e \ k = 0, ...,p is a d x d diagonal matrix associated with the k — th 
monomial s k and is expressed as: 


j [B? e) = ( / N,s^r) i d 
J r,(«> 


( 17 ) 


where N/ is the shape function associated with the l — th node of element e and 
I d is the d x d identity matrix. As an example, for elements that have two and 
only two nodes lying on T; (5 = 2 ) and for the case of linear shape functions N/, 
the submatrices jj. B p (e) , l = 1 , 2 , k < p, are given by: 


1 

k 


2 

k 


j 

j 


1 


•S 2 - Si 
1 

S2 ~ S 1 




k + 2 


— S 


k + 2 


.*+2 


k + 2 


+ 


— s 


k+i 


• S 2 


k -f- 1 


, fc +2 


,k+ 2 
>1 


, fc +2 


k + 2 


+ 


— s 


fc+i 


^1 


k + 1 


)U 


)Ii 


(IS) 
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where si < 52 , and si and 52 are the curvilinear abscissae of the two nodes of 
element e that lie on T/ (Fig. 2). 

At this point, it is worthwhile to point out that increasing the degree of the 
polynomial approximation of \ k , k = 1,2, ..., d involves only adding a few columns 
to the existing element level matrices , B^ e \ as it is suggested by equation (16). 

For two-subdomain teaxings, the constraint matrices Bj have the following 
pattern: 


b; = (o, B'] ; = i, 2 


(19) 


where O j is an (p+ l)d x nj null matrix and is the ( p+ l)d x nj sparse matrix: 

T>n ... V rr 


= 


L^+1.1 — £p+ l,r 


( 20 ) 


where r = nj/d and T>ij is a d x d diagonal matrix. 


Equations (12) above can be re-arranged as: 


Fj 

-R2 


rA„i 

p 

,-R' r 

O 


ot 



[ B'K+f 2 - BfKr'fi 
-R 2 r f 2 
ui = Kf 1 (f 1 + B^Ap) 

U 2 = K2"(f2 — B^A p ) + R 2 c* 


( 21 ) 


where all variables have the same physical meaning as previously. However, the 
size resulting interface system 


f; = B'Kr 1 Bf r +B;K+B; r 


( 22 ) 


is now only (p+ l)d x (p + l)d. Since A p does not enforce the continuity constraint 
equation (3) at each of the nodes of the discrete interface T/, the finite element 
field approximations Ui and u 2 given by the solution of system (21) are in general 
discontinuous along F /. In order to uniquely define the finite element solution 
along the interface boundary, we average the two computed solutions to obtain: 


12 


(23) 


u 


= u|r, = -(uj+ua)!^ 


We postulate that the above averaged interface solution u* is more accu- 
rate than each of the restrictions of the subdomain solutions Uilr, and U 2 I 1 V 
Therefore, we back- propagate to the interior of the subdomains fti and ft 2 the 
enhancing effect of the averaging procedure (23) by imposing u — u* on T/ and 
solving two independent displacement-driven subdomain problems. For this pur- 
pose, we first partition the stiffness matrix of each subdomain as: 


= 


K } 


f 33 

T 


i K JsI K J 


i = 1,2 


(24) 


where the subscripts ss, II and si refer respectively to interior, interface and 
interior /interface coupling quantities. For any set of given boundary conditions 
and any mesh decomposition pattern, the resulting Kj 55 stiffness matrix is non- 
singular. Next, the improved finite element subdomain solutions are computed 
as: 


j = 1,2 (25) 

It should be noted that the above improvement of the subdomain solutions Ui and 
U 2 is perfectly parallelizable and requires only one sparse matrix-vector multiply 
and one pair of sparse forward/backward substitutions per subdomain. The tri- 
angular factors of K j55 are embedded in those of K j which have been previously 
computed. 

Usually, the stresses that develop in a structure are more important to the 
analyst than the displacements it undergoes. However, the above improvement 
procedure is such that if u* is a highly accurate approximation of the interface 
solution, Uj, j = 1,2 become highly accurate approximations of the subdomain 
solutions and therefore it is not necessary to monitor the stress fields. 

The solution approach presented here requires a parametrization of the inter- 
face boundary T/. For a given finite element model and a given mesh decompo- 
sition, the interface boundary T j is always well defined for continuum problems. 
Therefore, its parametrization is straightforward, especially for two-subdomain 
problems. However, lattice structures require a special treatment. For the lat- 
ter problems, if T/ is constrained to follow the path defined by the structural 
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members that connect the nodes that axe shared by two lattice subdomains, T / 
will not be identical on both sides of the interface (Fig. 3a-3c). Therefore for 
lattice structures we select T/ as the “geometrical path” that (a) is the simplest 
to parametrize, and (b) has the same trace on the lattice subdomains it inter- 
connects. In particular, only the finite element nodes of this interface need to 
intersect with the structure. Figure 3d depicts T / for the structure shown in 
Figure 3a. 

In general, the number of Lagrange multipliers, N \ , that is needed to achieve 
a certain accuracy is problem dependent. If this number is rather small — say 
less than a hundred, then it is feasible to form explicitly Fj and solve the system 
of equations (21) using a direct method. Otherwise, the semi-iterative solution 
algorithm developed in reference [1] is recommended. However, beyond a certain 
value of N\, the polynomial approach developed in this Section becomes nu- 
merically problematic. Indeed, approximating the Lagrange multiplier functions 
with higher order polynomials of degree p = N\/d — 1 typically results in very 
ill-conditioned matrices B^KJo? , which may cause the performance and/or 
accuracy of the proposed computational method to deteriorate. Next in Section 4 
we develop piecewise low order polynomial approximations for the finite element 
discretization of the Lagrange multiplier functions (11), that are suitable for the 
case of a rather large value of N\. We remind the reader that d denotes the num- 
ber of degrees of freedom per node; for simplicity, it is assumed to be constant 
over the nodes. Therefore, since N \ denotes the total number of discrete Lagrange 
multipliers, N\/d represents the number of locations where surface tractions, or 
discrete Lagrange multipliers, are to be introduced. 
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CcO 


FIG. 3 Interface boundary definition for lattice structure 
(a) truss structure - (b) continuum-like left interface 
(c) continuum-like right interface - (d) adopted interface boundary 
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4. Piecewise low order polynomial approximations 

The objective of this section is to develop an alternative procedure for the fini te 
element discretization of the interface tractions that results in a better conditioned 
interface problem than previously when the total number of discrete Lagrange 
multipliers that axe introduced, N\, is rather large. 

Let T*, k = 0, ...,N\/d — 2 denote a partition V of the interface boundary 
T/ defined as: 


r i = [s k , s k+1 ] k = 0,...,N x /d-2 (26) 

where s*, k = 0 — 1 are the curvilinear abscissae of N\/d specified 
points on T/ where the discrete surface tractions X J k are introduced. Within 
each subinterval F j, we define d cubic polynomial expressions for the Lagrange 
multiplier approximations as: 

^ 10 ) = c \k+ c 2 k( s - s k) + 4k( 3 - s k) 2 + c\ k (s-s k ) 3 
■^fcC 05 ) = C 1 k d" c 2k( S ~ S k) + c 3k( s — ■ s fc) 2 + — Sfc) 3 

(27) 


^k( s ) — c lk + C 2k( S ~ S k) + C* k (s — Sk) 2 + cf k (s — Sk) 3 


where c? k , i = j = 1 are determined by imposing: 


Hm =a{ 


Ml 

ds 


( s k ) 


d\i_ 

ds 


(•St) ; 


^i( s fc+i) = 

dX{ ( , dX\ 

'3r (s ‘ +i> = ■* (Si+i) 


k = 0 N x /d- 2 

j = 1, ..., d 


(28) 


The first set of equations (28) imply that A{(s fc+ i) = A^ +1 (sjt + i), so that all X j are 
guaranteed to be continuously approximated on T/. The second set of equations 
(28) involve the derivatives of the Lagrange multiplier functions which are neither 
available nor part of the weak form of the static equations of equilibrium (10). 
Following Conte and de Boor [8], we approximate these derivatives by: 
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(29) 


d\{ 

ds 


M 


&2Sk 


where A s* and A 2 Sjt are defined as: 


Asjt — s k +i - Sk 
A 2 sjt = Sk + 1 — Sk- 1 


(30) 


Note that (29) requires the two additional points s_i and which we choose 

as: 


•5 — 1 = 5 2 
s N x /d = SNx/d-3 


(31) 


Substituting (27) and (29) into (28) determines the constants c J ik 
of the discrete Lagrange multipliers: 

J — \J 

c \k ~ A k 

c 2k ~ 6*A{ +1 + r)2k^i + C2fcA^_ 1 

C 3 k = £u A fc+2 + KkK+l + C3fcA{. + I'SifcAj'.., 

c 4 k = £4fcA{ +2 + r]4k^i +1 + C4,Ai + 1^4fcA{_ 1 


as functions 


(32) 


where £ 2k ? £3 ki £4 ki *72 ki t ?3 ki ^?4 ki C2 k) Cski Ciki ^3 fci and 1/4^ are constants that 
depend only the curvilinear abscissae Sk-u Sk 7 ^ jfc-hi and sjt+2 (see Appendix A). 


As previously, equations (32) are substituted into equations (27) and (27) 
into (10) to obtain: 

K lUl =u+b^ t \ v 

K 2 u 2 = f 2 - B^ T X v ( 33 ) 

Bfui = B^u 2 


where B^ and B? are now non-boolean finite element matrices of sizes N\ x 
(nj + n/) and N\ x (n| + n/). The subscript /superscript V emphasizes the de- 
pendence of these quantities on the partition V of the interface boundary T/ (26). 
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Both matrices axe assembled from their element level correspondents ** B 
in the usual manner: 


■p(e) 

1 


and 



e 


(34) 


where e spans only the set of elements that axe connected to T j. The left subscript 

k e emphasizes the dependence of on the subinterval T* = [s*, st+i] where 

one edge of element e fails. For a finite element e with q nodes lying on T /, the 

qd x N\ element level matrices k ‘ j = 1,2 axe given by: 


k‘ 



1 

k* 

2 

k‘ 


B 


■p («)i 
7>(e) 


9 

L k* 



(35) 


where B J'*' \ l = 1, 2, q is a d x N\ matrix associated with the l — th node 
of element e and has the following form: 


i 

k « 




/ 

Jt e -1 



/ / R-P< e ) ( R7>( e ) 

k*°i k* + l°j k*+2°j 



(36) 


where and axe respectively left and right d x (k* — l)d ajid d x (N x - 
(k e + 3)d) zero matrices, and is expressed as: 

= 0 k . I d (37) 


where 1^ is the dxd identity matrix and /3k* - 1 , , /3k*+ i and 0 k *+2 are function 

only of the partition V o£ T j and axe given by the following integration: 


J ( ^ A^ e N/5r — + /?*eA{ e + /3jk« +1 A{ e+1 + 0k*+2^{» + 2 


(38) 
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It should be noted that while the symbolic derivation of equations (36-38) appears 
to be somehow complicated, their computer implementation is straightforward 
and their processing is inexpensive. 

For two-subdomain tearings, the constraint matrices take the following 
form: 

B J = [Oj fij] j = 1,2 (39) 

where 0 ; is an N\ x n \ • null matrix and B p is the N\ x nj four diagonal sparse 
matrix: 



'\ \ 

\ 

\ 

o 

... 

O ' 

B p - 

O T>k-i,k 

^kk 

^k,k+ 1 

^k,k+ 2 

o 

a 

a j ~ 

... 

\ 

\ 

\ 

\ ... 

o 



... 

... 

... 

o v r . ljr 

V rr. 


where r = N\/d and is a d x d diagonal matrix. 

After Bf and B^ axe set up, the system of equations (33) is solved as 
described in Section 3. In particular, the averaging and correcting procedures 
outlined in Section 3 are also used. 

Approximating A with polynomials (Section 3) does not require the location 
of the corresponding physical surface tractions to be specified. On the other hand, 
using piecewise low order polynomials for that purpose (Section 4) requires the 
definition of a partitioning V of T j, which corresponds to specifying the location 
of the physical surface tractions along T/. Therefore from a practical viewpoint, 
the first approach seems more attractive. However, specifying where a surface 
traction is to be introduced can be turned into an advantage if one looks at it as 
an additional freedom. For example, if the stress field along T / can be predicted 
qualitatively prior to the analysis, the partition V will be refined in the axeas of 
oscillation or high concentration, and coarse otherwise. That would improve the 
efficiency of the approximation. 


19 



5. An iterative refinement procedure for accuracy improvement 

Here we outline an iterative refinement procedure for improving the accuracy of 
the results when it is required. We discuss both cases of polynomial and piecewise 
low order polynomial approximations, and assume that a reasonable initial value 
is given. We select as convergence criterion: 

l|u 1< ” +1, ||„ - ||u‘ ( ’" ) |U <e||u‘ ( ” ,) |U 

l|u 2 <m+I, |U-||u 2( ’" , |U < e ||u* , ’” ) |U (41) 

l|u j(m+1) |U-||u j(m, |U 


where the superscripts d and m refer respectively to the d — th component of the 
solution at each node and to the m — th iteration, and e is a specified tolerance. 
As indicated by equations (41), we independently monitor the convergence of 
each of the d components of the displacement field. This is in order to avoid 
that potential important relative errors in a component of the solution whose 
magnitude is relatively small — for example, the x displacement of a cantilever 
beam with a load parallel to the y direction, are masked by an otherwise perfect 
convergence for a component of the solution whose magnitude is relatively large 
— for example, the y displacement. 


5.1. Polynomial approximation 


Let and * / d — 1 denote respectively the number of discrete 

Lagrange multipliers and the degree of the polynomial approximation of A at 
iteration m. Suppose that for the above convergence criterion (41) is not 

met. A simple iterative refinement procedure consists in introducing at iteration 
m + 1 an additional discrete Lagrange multiplier by considering a polynomial 
approximation A of order p( m+1 ) = p( m ) i. This entails the generation of the 
constraint matrices , j = 1,2 and therefore of the element level matrices 


B*; , l = 1 ,...,?. A careful examination of equations (15-17) reveals that 

i » (in) (®) 

Bj- can be computed very fast by updating ‘B^ as following: 


i 


B> 


(m + l) ( c ) 



( m) ( c ) 


/ »P (m) + 1 


(42) 
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where 


(43) 


; M+1 0; <m>+1 = ( / N^-’+^r) U 

J r/( e > 

Therefore, if at each iteration m + 1 an additional Lagrange multiplier is intro- 
duced, the re-generation of the constraint matrices requires only the evaluation 
for each interface element of the integral fp f (c) N is^ )+1 6r, and the re-generation 
of the interface system F/ requires only the pre/post multiplication of the sub- 
domain flexibilities with these matrices. Given that << nj, these mul- 

tiplications are not expensive. In particular, they are much more economical 
than those corresponding to a typical conjugate gradient iteration for the case 
N x = nj. The introduction at iteration m 4- 1 of more than one discrete Lagrange 
multiplier is handled exactly in the same manner. 


5.2. Piecewise low order polynomial approximation 

Let r^ (m) , k = 0, /d - 2 denote the partition V of the interface bound- 

ary T/ at iteration m: 


fc (m) r (m) (m) i 

1 I ~ l 5 Jfc > 


k = 0,...,N[ m) /d-2 


(44) 


If at iteration m + 1 an additional discrete Lagrange multiplier is introduced, say 
in the subinterval Tj* (m) , the resulting partition •p( m + 1 ) becomes: 


fc (m + i) _ , (m+1) (m+1)-, 

L I ~ l* k ' S k + l i 


k =0, 


N ( x m) /d - 1 


(45) 


where 


s[ m+1) = s[ m) k < k* 

s[ m+1) = s[z\ k >k* + 1 


(46) 


It can be easily shown that the regeneration at iteration m + 1 of the constraint 
matrices and involves basically recomputing the coefficients c J ik , i = 
j = 1 , ...,d of the polynomial expressions (32), only for those interface 


i.* ( ITl-j- l) 

elements which intersect F/ inside T j , or T* , or I f , 


or 
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r* +2 . However, a refinement procedure for this case should also specify 

the location where an additional discrete Lagrange multiplier is to be introduced 
during the following iteration, — that is to define In this work, we choose 

fo that purpose the point of T/ where ||u^ m+1 )|| 00 -||u( m) || 00 /||u (m) ||oo and/or the 
violation of static equilibrium prior to the averaging and improvement procedures 
(23) and (25) are the largest. The introduction at iteration m + 1 of more than 
one discrete Lagrange multiplier is handled exactly in the same maimer. 


6. Validation and Performance Evaluation 

Ideally, the accuracy of the proposed hybrid method for a given value of N\ 
should be assessed by comparing its generated results to the exact (analytical) 
solution of the continuum or lattice problem of interest. However, the latter 
solution is seldom available. Therefore, we select as reference the conventional 
finite element solution of the problem — that is, the solution that is obtained 
without the introduction of a hybrid variational principle, and refer to it as the 
exact solution. 

In this section, we validate first the essence of this paper with simple two- 
subdomain structural problems. For each example, we apply the iterative refine- 
ment procedures outlined in Section 5 to generate numerical results corresponding 
to various numbers of Lagrange multipliers, N\. We report on only the computed 
solutions associated with the interface boundary T/. This is because whenever 
these converge to the reference solution, the improvement procedure (25) guar- 
antees that the computed subdomain displacement and stress fields also converge 
to their reference solution. All examples indicate that a number of traction forces 
that is only a small fraction of the size of the discrete interface boundary T/ are 
required to “glue” the incomplete subdomain solutions. Next, we assess the per- 
formance of the developed computational hybrid algorithm with the large-scale 
finite element static analysis of the Solid Rocket Booster (SRB) on a 4 processor 
CRAY Y-MP; we demonstrate that for that problem, our algorithm outperforms 
the fastest of the available parallel skyline solvers. 
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6.1 Validation 


First, we consider the static analysis of an unsymmetric beam that is clamped at 
both ends and subjected to both a horizontal and vertical point loadings. The 
beam is discretized using 4-node plane stress elements ( q = 4) with two degrees 
of freedom per node ( d = 2). The finite element mesh is decomposed into 2 
subdomains, each with 108 interior degrees of freedom. For this problem, the size 
of interface problem is n/ = 18. 



FIG. 4 Two-subdomain decomposition 
of an unsymmetric clamped- clamped beam 
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The interface tractions are approximated successively with polynomials of order 
zero, one, two, and three, — that is, N\ = 2,4,6, and 8. The generated hybrid 
solutions are reported in Figures (5-6) for both the horizontal and vertical dis- 
placement fields along the interface boundary T/. For N\ = 6, both displacement 
fields are shown to be in excellent agreement with the exact solution. 
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FIG. 5 Unsymmetric beam problem : predicted horizontal displacement field 
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FIG. 6 Unsymmetric beam problem: predicted vertical displacement field 
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Next, we analyze an unsymmetric planar truss structure (5 — 2, d 2) with 
312 degrees of freedom. The unsymmetry is induced by the members material 
properties which are different on both sides of the axis of geometrical symmetry. 
The truss structure is also loaded in both directions as shown in Figure 7. The 
lattice mesh is decomposed in two subdomains, each with 144 internal degrees of 
freedom. The interface boundary Tj is depicted in Figure 7. 






FIG. 7 Two-subdomain decomposition of an unsymmetric fixed-fixed truss 


The size of the interface problem for the above structure is rather small (n/ = 
24), so that polynomial approximations for the Lagrange multiplier functions 
are considered again. The predicted vertical and horizontal displacements using 
the tearing hybrid method are reported in Figures (8-9). Adequate accuracy is 
achieved for N\ = 6 , which corresponds to only 25% of the number of degrees of 
freedom along T/. 








Finally, we select to illustrate the use of piecewise low order polynomials for 
the approximation of the interface tractions with the static analysis of a cantilever 
beam. A finite element mesh with 300 degrees of freedom is constructed using 4- 
node plane stress elements ( q = 4) with two degrees of freedom per node (d = 2). 
It is partitioned in two non-floating subdomains, each with a minimum bandwidth 
(Fig. 10). The horizontal slicing adopted for this problem avoids the subdomain 
singularity but produces a larger interface than a vertical slicing. The size of the 
interface problem is 60. 



FIG. 10 Two-subdomain decomposition of cantilever beam 


An intitial partition of T/ is defined using four points (N\ = 8), of which 

three are clustered towards the free end where the vertical force is applied. The 
iterative refinement procedure of Section 5 introduces an additional point in the 
subinterval that is closest to the load (Fig. 11). 



FIG. 11 Successive partitionings ofTj 
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Within three iterations, the tearing hybrid algorithm is shown to converge 
towards the exact solution (Fig. 12-13). Note however that it took only two 
iterations for the vertical displacement to converge. This example illustrates the 
need for a component-by-component convergence criterion as in (41). 
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FIG. 12 Cantilever problem: predicted horizontal displacement field 


30 






300 GLOBAL D. 0. F. 


60 INTERFACE D. 0. F. 


-0.07 

-0.14 

- 0.21 

-0.28 

-0.35 

-0.42 

-0.49 

-0.56 

-0.63 

-0.70 



Curvilinear abscissa 


exact 5-point spline 

4-point spline 

FIG. 13 Cantilever problem: predicted vertical displacement field 


For the above problem, Figure 14 compares the condition numbers of BiB { for 
various values of N\, when the traction forces are approximated with polynomials 
and piecewise low order polynomials. The advantage of the latter approximation 
is clearly demonstrated. 
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6.2 Performance evaluation 


Here we report on the performance of the proposed computational algorithm 
for a large-scale structural problem. The corresponding parallel/ vector code is 
implemented on a CRAY Y-MP multiprocessor. Even though this system accom- 
modates up to 8 processors, only 4 CPUs were available to us. 

We consider the solution of the system of equations arising in the linear static 
analysis of the SRB when loaded by internal pressure in its Solid Rocket Motor 
(SRM) subsystem. The discretized SRB model has 10,453 elements, 9,206 nodes 
and 54,870 degrees of freedom (FIG. 15). After node-renumbering, the average 
profile bandwidth is 310. The finite element mesh is decomposed in 4 subdo- 
mains, each with approximately 2,613 elements. The decomposition is carried 
out along the longitudinal direction of the structure, using one-way separators 
only. This restriction will be removed in future developments. The optimized 
average profile bandwidth for each of the 4 subdomains is 91. Each of the 4 sep- 
arators include approximately 920 degrees of freedom. The size of the interface 
problem is 3692. The tolerance e for the convergence criterion (41) is set to 10 -4 . 
Given the size of the interface problem, a rather large number of discrete Lagrange 
multipliers is anticipated. Therefore, the piecewise low order polynomial approx- 
imation switch is activated and the preconditioned projected conjugate gradient 
algorithm described in reference [1] is invoked for the solution of the interface 
problem. The hybrid algorithm achieves convergence after 3 iterative refinement 
steps with N\ = 283. The computed results are compared with those generated 
by a parallel/vector skyline solver for validation. Table 1 below reports the CPU 
timings for the proposed algorithm and compares them with those of the fastest 
solutions that have been published for this problem (Storaasli, Nguyen and Agar- 
wal [9], Farhat [10]). Clearly, the proposed algorithm is shown to be significantly 
faster in both serial and parallel environments. 
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FIG. 15 Finite element discretization of the SRB 


TABLE 1. Equation solving on the CRAY_ Y-MF 
SRB structural model - 54,870 d.o.f. 


Number of processors 


1 

2 

4 


CPU time CPU time 

Skyline solver Tearing Hybrid Algorithm 

39 secs 20.18 secs 

19.79 secs 10.21 secs 

10 secs 5.19 secs 
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7. Conclusion 


Recently, Farhat and Roux [1] have developed a domain decomposition algorithm 
based on a hybrid variational principle, for the parallel finite element solution of 
self-adjoint elliptic partial differential equations. First, the spatial domain was 
partitioned into a set of totally disconnected subdomains and an incomplete fi- 
nite element solution was computed in each of these subdomains. Next, a set 
of Lagrange multipliers representing surface tractions were introduced at each 
degree of freedom of the discretized binding interface in order to enforce compat- 
ibility constraints between the independent local finite element approximations. 
For structural and mechanical problems, the resulting algorithm was shown to 
outperform the conventional method of substructures, especially on parallel pro- 
cessors. In this work, we have investigated the use of a much lower number of 
Lagrange multipliers, N\, for interconnecting the incomplete field finite element 
solutions. For that purpose, we have derived finite element procedures for both 
global and piecewise low order polynomial approximations of the interface trac- 
tions. Through simple structural examples, we have shown that a high accuracy 
can be reached with a value of N\ that is only a small percentage of the total 
number of interface degrees of freedom. With this modification, the performance 
of the hybrid algorithm presented in [1] is drastically improved since it deals with 
a much smaller interface or reduced system. Even though we have addressed 
only the two-subdomain decomposition, the procedure is readily applicable to 
many-subdomain problems where only one-way separators are used for the mesh 
decomposition. We have illustrated the latter case with the large-scale static 
analysis of the Solid Rocket Booster (SRB) on a 4 processor CRAY Y-MP. For 
that problem, the modified hybrid algorithm is shown to outperform parallel sky- 
line solvers in both serial and parallel environments. Future w T ork will focus on 
the case of arbitrary mesh decompositions and on time dependent problems. 


Appendix A. Piecewise-cubic Bessel interpolation 

Let r), k = 0 ,...,N\/d — 2 denote a partition V of the interface boundary Tj 
defined as: 

r k f = K s k+1 ] fc = 0, N\/d — 2 
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Within each subinterval Tj,d cubic polynomials axe defined as: 


AfcO) = c lJt + c 2k( S - S k) + c\ k (s - Sk) 2 +c\ k (s - Sfc ) 3 
~ X l( s ) = C lk + C 2 k( S -*k) + cl k (s-s k ) 2 + cl k (s-s k ) 3 


^ k { S ) — c \ k + C 2k( S S k) + c 3k( S ~~ S k) 2 + ~ s k) 


The coefficients cj k , i = j = 1 are determined by imposing 

x i(sk) = A l ; A{.(sfc + x) = \ J k+1 

d\i, s d\> s d\{, N d\> , 

~(s k ) = — (s t ) ; -^-(^+1) = - 5r (^ + .) 


ds 

^£M A *+i - A fc ) + ^f^-(A fc - A fc _x) 

HT [Sk) = aTh 

k = 0 , — 2 

j = 1, d 


where As*,, and A-2-Sfc are defined as: 

As* = Sfc+i — Sfc 

^2$k — ■Sfc+l ■Sfc— 1 
The solution of the above equations yield: 
c i k ~ X i 

C 2k = ^2fcA{ +1 4- T]2k X i + C2fcA{_j 

C 3Jt = ^3fcA{. + 2 + ??3fcA fc +1 + c 3fcA{ + l^3k^i-i 

c ik = ^4fcA{. +2 + rjik x i+i + C-tfcA^ + ^4fcA{_j 
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where 


&2 k = 
r}2k = 
C2Jt = 
£3 Jfc = 

V3k = 
C3< t = 
l/ 3 fc = 

^4fc = 
Vik = 
C-ik = 
^4fc = 


A.Sfc-1 

As k A 2 s k 

A^fc A^fc-i 

Asfc_iA 2 3jk As k A 2 s k 
-Ask 

As k -i A 2 s k 
-1 

A^fc+i A2SJ1-+1 

3 A.Sfc-1 1 A^fc+i Asfc-i 

As\ As 2 k A 2 s k A5fc+iA25fc+i As 2 k A 2 s k +\ As 2 k A 2 s k 

-3 2 Asfc-i | Asfc + i | A^fc — 1 

As\ AoSfc-iA^fc As 2 k A 2 s k Asj[A 2 .Sfc+i As 2 f.A 2 s k 

2 


Asfc_i A 2 s k 
1 


A^fcA^fc+i A 2 5fc +1 

-1 A^fc+i + A^fc-i 2 

AdfcA^fc+i A 2 5fc+i A5|A 2 5fc+i As 3 k A 2 s k A s k 

-A^fc+i Asfc_i ^ 1 !_ 2 

As k A 2 s k+ i As 3 k A 2 s k As k As k -iA 2 s k A s\ 


-1 


As k -iAs k A 2 s k 
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